PCT 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 

International Bureau 




INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 6 : 

C12Q 1/68, C12P 19/34, C07H 21/04 



Al 



(11) International Publication Number: WO 99/09217 

(43) International Publication Date: 25 February 1999 (25.02.99) 



(21) International Application Number: 



PCT/US98/16966 



(22) International Filing Date: 



14 August 1998 (14.08.98) 



(30) Priority Data: 

08/912,885 
08/947,779 
08/959,365 



15 August 1997 (15.08.97) 
9 October 1997 (09.10.97) 
28 October 1997 (28.10.97) 



US 
US 

us 



(71) Applicant (for all designated States except US): HYSEQ, INC. 

[US/US]; 670 Almanor Avenue, Sunnyvale, CA 94086 (US). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only): DRMANAC, Radoje 
[YU/US]; 850 East Greenwich Place, Palo Alto, CA 94303 
(US). DRMANAC, Snezana [YU/US]; 850 East Greenwich 
Place, Palo Alto, CA 94303 (US). BAIDYA, Narayan 
[IN/US]; 966 Helen Avenue #1, Sunnyvale, CA 94086 (US). 

(74) Agents: ABRAMS, Samuel, B. et al.; Pennie & Edmonds LLP, 
1 155 Avenue of the Americas, New York, NY 10036 (US). 



(81) Designated States: AL, AM, AU, AZ, BA, BB, BG, BR, BY, 
CA, CN, CU, CZ, EE, GE, HR, HU, ID, IL, IS, JP, KG, 
KP, KR, KZ, LC, LK, LR, LT, LV, MD, MG, MK, MN, 
MX, NO, NZ, PL, RO, RU, SG, SI, SK, SL, TJ, TM, TR, 
TT, UA, US, UZ, VN, YU, ARIPO patent (GH, GM, KE, 
— LS r MW, SD, SZ, UG, ZW), Eurasian patent (AM, AZ, BY, 
KG, KZ, MD, RU, TJ, TM), European patent (AT, BE, CH, 
CY, DE, DK, ES, FI, FR, GB, GR, IE, IT, LU, MC, NL, 
PT, SE), OAPI patent (BF, BJ, CF, CG, CI, CM, GA, GN, 
GW, ML, MR, NE, SN, TD, TG). 



Published 

With international search report. 



(54) Title: METHODS AND COMPOSITIONS FOR DETECTION OR QUANTIFICATION OF NUCLEIC ACID SPECIES 



(57) Abstract 

The present invention provides a method for detecting a 
target nucleic acid species using an array of probes affixed to a 
substrate and a plurality of labeled probes. The invention also 
relates to oligonucleotide probes attached to discrete particles 
wherein the particles can be grouped into a plurality of sets 
based on a physical property. A different probe is attached 
to the discrete particles of each set, and the identity of 
the probe is determined by identifying the discrete particles 
from their physical property. The invention further relates 
to methods using agents which destabilize the binding of 
complementary polynucleotide strands (decrease the binding 
energy) or increase stability of binding between complementary 
polynucleotide strands (increase the binding energy). The figure 
is an illustration of an apparatus for mass producing probe 
arrays. 
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METHODS AND COMPOSITIONS FOR DETECTION OR 
QUANTIFICATION OF NUCLEIC ACID SPECIES 



CROSS REFERENCE TO RELATED APPLICATIONS 
5 This patent application is a continuation-in-part of U.S. Patent application Serial 

No. 08/912,885, filed on August 15, 1997, and a continuation-in-part of U.S. Patent 
application Serial No. 08/947,779, filed on October 9, 1997, and a continuation-in-part of 
U.S. Patent application Serial No. 08/892,503, filed on July 14, 1997, and a 
continuation-in-part of U.S. Patent application Serial No. 08/812,951, filed on March 4, 
10 1997, and a continuation-in-part of U.S. Patent application Serial No. 08/784,747, filed 
on January 16, 1997. 

FIELD OF THE INVENTION 

This invention relates in general to methods and apparatus for nucleic acid 
15 analysis, and, in particular, to methods and apparati for nucleic acid analysis. 

BACKGROUND 

The rate of determining the sequence of the four nucleotides in nucleic acid 
samples is a major technical obstacle for further advancement of molecular biology, 

20 medicine, and biotechnology. Nucleic acid sequencing methods which involve separation 
of nucleic acid molecules in a gel have been in use since 1978. The other proven method 
for sequencing nucleic acids is sequencing by hybridization (SBH). 

The traditional method of determining a sequence of nucleotides ( i.e., the order 
of the A, G, C and T nucleotides in a sample) is performed by preparing a mixture of 

25 randomly-terminated, differentially labelled nucleic acid fragments by degradation at 
specific nucleotides, or by dideoxy chain termination of replicating strands. Resulting 
nucleic acid fragments in the range of 1 to 500 bp are then separated on a gel to produce 
a ladder of bands wherein the adjacent samples differ in length by one nucleotide. 
The array-based approach of SBH does not require single base resolution in 

30 separation, degradation, synthesis or imaging of a nucleic acid molecule. Using mismatch 
discriminative hybridization of short oligonucleotides K bases in length, lists of 
constituent K-mer oligonucleotides may be determined for target nucleic acid. Sequence 
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for the target nucleic acid may be assembled by uniquely overlapping scored 
oligonucleotides. 

There are several approaches available to achieve sequencing by hybridization. In 
a process called SBH Format 1 , nucleic acid samples are arrayed, and labeled probes are 
5 hybridized with the samples. Replica membranes with the same sets of sample nucleic 
acids may be used for parallel scoring of several probes and/or probes may be 
multiplexed^Nuclelc acid samples may be arrayed~and hybridized on nylon membranes 
or other suitable supports. Each membrane array may be reused many times. Format 1 is 
especially efficient for batch processing large numbers of samples. 

10 In SBH Format 2, probes are arrayed at locations on a substrate which correspond 

to their respective sequences, and a labelled nucleic acid sample fragment is hybridized to 
the arrayed probes. In this case, sequence information about a fragment may be 
determined in a simultaneous hybridization reaction with all of the arrayed probes. For 
sequencing other nucleic acid fragments, the same oligonucleotide array may be reused. 

15 The arrays may be produced by spotting or by in situ synthesis of probes. 

In Format 3 SBH, two sets of probes are used. In one embodiment, a set may be 
in the form of arrays of probes with known positions, and another, labelled set may be 
stored in multiwell plates. In this case, target nucleic acid need not be labelled. Target 
nucleic acid and one or more labelled probes are added to the arrayed sets of probes. If 

20 one attached probe and one labelled probe both hybridize contiguously on the target 

nucleic acid, they are covalently ligated, producing a detected sequence equal to the sum 
of the length of the ligated probes. The process allows for sequencing long nucleic acid 
fragments, e.g. a complete bacterial genome, without nucleic acid subcloning in smaller 
pieces. 

25 In the present invention, SBH is applied to the efficient identification and 

sequencing of one or more nucleic acid samples. The procedure has many applications in 
nucleic acid diagnostics, forensics, and gene mapping. It also may be used to identify 
mutations responsible for genetic disorders and other traits, to assess biodiversity and to 
produce many other types of data dependent on nucleic acid sequence. 

30 



- 2 - 



WO 99/09217 



PCT/US98/16966 



STTMMARY OF THE INVENTION 

The present invention provides a method for detecting a target nucleic acid species 
including the steps of providing an array of probes affixed to a substrate and a plurality of 
labeled probes wherein each labeled probe is selected to have a first nucleic acid sequence 

5 which is complementary to a first portion of a target nucleic acid and wherein the nucleic 
acid sequence of at least one probe affixed to the substrate is complementary to a second 
portion of the nucleic acid sequence of the target, the second portion being adjacent to the 
first portion; applying a target nucleic acid to the array under suitable conditions for 
hybridization of probe sequences to complementary sequences; introducing a labeled 

10 probe to the array; hybridizing a probe affixed to the substrate to the target nucleic icid; 
hybridizing the labeled probe to the target nucleic acid; affixing the labeled probe to an 
adjacently hybridized probe in the array; and detecting the labeled probe affixed to the 
probe in the array. According to preferred methods of the invention the array of probes 
affixed to the substrate comprises a universal set of probes. According to other preferred 

15 aspects of the invention at least two of the probes affixed to the substrate define 

overlapping sequences of the target nucleic acid sequence and more preferably at least 
two of the labelled probes define overlapping sequences of the target nucleic acid 
sequences. Still further, according to another aspect of the invention a method is 
provided for detecting a target nucleic acid of known sequence comprising the steps of: 

20 contacting a nucleic acid sample with a set of immobilized oligonucleotide probes attached 
to a solid substrate under hybridizing conditions wherein the immobilized probes are 
capable of specific hybridization with different portions of said target nucleic acid 
sequence; contacting the target nucleic acid with a set of labelled oligonucleotide probes 
in solution under hybridizing conditions wherein the labeled probes are capable of specific 

25 hybridization with different portions of said target nucleic acid sequence adjacent to the 
immobilized probes; covalently joining the immobilized probes to labelled probes that are 
immediately adjacent to the immobilized probe on the target sequence (e.g., with ligase); 
removing any non-ligated labelled probes; detecting the presence of the target nucleic acid 
by detecting the presence of said labelled probe attached to the immobilized probes. The 

30 invention also provides a method of determining expression of a member of a set of 
partially or completely sequenced genes in a cell type, a tissue or a tissue mixture 
comprising the steps of: defining pairs of fixed and labeled probes specific for the 
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sequenced gene; hybridizing unlabeled nucleic acid sample and corresponding labeled 
probes to one or more arrays of fixed probes; forming covalent bonds between adjacent 
hybridized labeled and fixed probes; removing unligated probes; and determining the 
presence of the sequenced gene by detection of labeled probes bound to prespecified 
5 locations in the array. In a preferred embodiment of this aspect of the invention, the 
target nucleic acid will identify the presence of an infectious agent. 

Furtherrthe present invention provides for an array of oligonucleotide probes 
comprising a nylon membrane; a plurality of subarrays of oligonucleotide probes on the 
nylon membrane, the subarrays comprising a plurality of individual spots wherein each 
10 spot is comprised of a plurality of oligonucleotide probes of the same sequence; and a 
plurality of hydrophobic barriers located between the subarrays on the nylon membrane, 
whereby the plurality of hyydrophobic barriers prevents cross contamination between 
adjacent subarrays. 

Still further, the present invention provides a method for sequencing a repetitive 

15 sequence, having a first end and a second end, in a target nucleic acid comprising the 
steps of: (a) providing a plurality of spacer oligonucleotides of varying lengths wherein 
the spacer oligonucleotides comprise the repetitive sequence; (b) providing a first 
oligonucleotide that is known to be adjacent to the first end of the repetitive sequence; (c) 
providing a plurality of second oligonucleotides one of which is adjacent to the second 

20 end of the repetitive sequence, wherein the plurality of second oligonucleotides is labeled; 
(d) hybridizing the first and the plurality of second oligonucleotides, and one of the 
plurality of spacer oligonucleotides to the target nucleic acid ; (e) ligating the hybridized 
oligonucleotides; (0 separating ligated oligonucleotides from unligated oligonucleotides; 
and (g) detecting label in the ligated oligonucleotides. 

25 Still further, the present invention provides a method for sequencing a branch 

point sequence, having a first end and a second end, in a target nucleic acid comprising 
the steps of: (a) providing a first oligonucleotide that is complementary to a first portion 
of the branch point sequence wherein the first oligonucleotide extends from the first end 
of the branch point sequence by at least one nucleotide; (b) providing a plurality of 

30 second oligonucleotides that are labeled, and are complementary to a second portion of 
the branch point sequence wherein the plurality of second oligonucleotides extend from 
the second end of the branch point sequence by at least one nucleotide, and wherein the 
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portion of the second oligonucleotides that extend from the second end of the branch point 
sequence comprise sequences that are complementary to a plurality of sequences that arise 
from the branch point sequence; (c) hybridizing the first oligonucleotide, and one of the 
plurality of second oligonucleotides to the target DNA; (d) ligating the hybridized 
5 oligonucleotides; (e) separating ligated oligonucleotides from unligated oligonucleotides; 

and (f) detecting label in the ligated oligonucleotides. 
~ Still further, the present invention provides a method for confirming a sequence by 

using probes that are predicted to be negative for the target nucleic acid. The sequence of 
a target is then confirmed by hybridizing the target nucleic acid to the "negative" probes 

i 

10 to confirm that these probes do not form perfect matches with the target nucleic acid. 

Still further, the present invention provides a method for analyzing a nucleic acid 
using oligonucleotide probes that are complexed with different labels so that the probes 
may be multiplexed in a hybridization reaction without a loss of sequence information 
(i.e., different probes have different labels so that hybridization of the different probes to 

15 the target can be distinguished). In a preferred embodiment, the labels are radioisotopes, 
or floursecent molecules, or enzymes, or electrophore mass labels. In a more preferred 
embodiment, the differently labeled oligonucleotides probes are used in format III SBH, 
and multiple probes (more than two, with one probe being the immobilized probe) are 
ligated together. 

20 Still further, the present invention provides a method for detecting the presence of 

a target nucleic acid having a known sequence when the target is present in very small 
amounts compared to homologous nucleic acids in a sample. In a preferred embodiment, 
the target nucleic acid is an allele present at very low frequency in a sample that has 
nucleic acids from a large number of sources. In an alternative preferred embodiment, 

25 the target nucleic acid has a mutated sequence, and is present at very low frequency 
within a sample of nucleic acids. 

Still further, the present invention provides a method for confirming the sequence 
of a target nucleic acid by using single pass gel sequencing. Primers for single pass gel 
sequencing are derived from the sequence obtained by SBH, and these primers are used in 

30 standard Sanger sequencing reactions to provide gel sequence information for the target 
nucleic acid. The sequence obtained by single pass gel sequencing is then compared to 
the SBH derived sequence to confirm the sequence. 
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Still further, the present invention provides a method for solving branch points by 
using single pass gel sequencing. Primers for the single pass gel sequencing reactions are 
identified from the ends of the Sfs obtained after a first round of SBH sequencing, and 
these primers are used in standard Sanger-sequencing reactions to provide gel sequencing 
5 information through the branch points of the Sfs. Sfs are then aligned by comparing the 
Sanger-sequencing results through the branch points to the Sfs to identify adjoining Sfs. 

Still fiirtherrthe present invention provides for a method of preparing asample 
containing target nucleic acids by PCR, without purifying the PCR products prior to the 
SBH reactions. In Format I SBH, crude PCR products are applied to a substrate without 
10 prior purification, and the substrate may be washed prior to introduction of the labeled 
probes. 

Still further, the present invention provides a method and an apparatus for 
analyzing a target nucleic acid. The apparatus comprises two arrays of nucleic acids that 
are mixed together at the desired time. In a preferred embodiment, the nucleic acids in 
15 one of the arrays are labeled. In a more preferred embodiment, a material is disposed 

i 

between the two arrays and this material prevents the mixing of nucleic acids in the 
arrays. When this material is removed, or rendered permeable, the nucleic acids in the 
two arrays are mixed together. In an alternative preferred embodiment, the nucleic acids 
in one array are target nucleic acids and the nucleic acids in the other are oligonucleotide 

20 probes. In another preferred embodiment, the nucleic acids in both arrays are 

oligonucleotide probes. In another preferred embodiment, the nucleic acids in one array 
are oligonucleotide probes and target nucleic acids, and nucleic acids in the other array 
are oligonucleotide probes. In another preferred embodiment, the nucleic acids in both 
arrays are oligonucleotide probes and target nucleic acids. 

25 One method of the present invention using the apparatus described above 

comprises the steps of providing an array of nucleic acids fixed to a substrate, providing a 
second array of nucleic acids, providing conditions that allow the nucleic acids in the 
second array to come into contact with the nucleic acids of the fixed array wherein one of 
the arrays of nucleic acids are target nucleic acids and the other array is oligonucleotide 

30 probes, and analyzing the hybridization results. In a preferred embodiment, the fixed 
array is target nucleic acid and the second array is labeled oligonucleotide probes. In a 
more preferred embodiment, there is a material disposed between the two arrays that 
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prevents mixing of the nucleic acids until the material is removed or rendered permeable 
to the nucleic acids. 

In a second method of the present invention using the apparatus described above 
comprises the steps of providing two arrays of nucleic acid probes, providing conditions 
5 that allow the two arrays of probes to come into contact with each other and a target 
nucleic acid, ligating together probes that are adjacent on the target nucleic acid, and 
"analyzing the results: - In apreferred embodiment, the probes in one array are fixed and 
the probes in the other array are labeled. In a more preferred embodiment, there is a 
material disposed between the two arrays that prevents mixing of the probes until the 
10 material is removed or rendered permeable to the probes. 

Still further, the present invention provides substrates on which arrays of 
oligonucleotide probes are fixed, wherein each probe is separated from its neighboring 
probes by a physical barrier that is resistant to the flow of the sample solution. In a 
preferred embodiment, the physical barrier is made of a hydrophobic material. 
15 Still further, the present invention provides a method for making the arrays of 

oligonucleotide probes that are separated by physical barriers. In a preferred 
embodiment, a grid is applied to the substrate using an ink-jet head that applies a material 
which reduces the reaction volume of the array. 

Still further, the present invention provides substrates on which oligonucleotides 
20 are fixed to form a three-dimensional array. The three-dimensional array combines high 
resolution for reading probe results (each level has a relatively low density of probes per 
cnr). with high information content in three dimensional space (multiple levels or 
probes). 

Still further, the present invention provides a substrate to which oligonucleotide 
25 probes are fixed, wherein the oligonucleotide probes have spacers, and wherein the 
spacers increase the distance between the substrate and the informational portion of the 
oligonucleotide probe (e.g., the portion of the oligonucleotide probe which binds to the 
target and gives sequence information). In a preferred embodiment, the spacer comprises 
ribose sugars and phosphates, wherein the phosphates covalently bind the ribose sugars 
30 into a polymer by forming esters with the ribose sugars through their 5' and 3' hydroxy 1 
groups. 
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Still further, the present invention provides a method for clustering cDNA clones 
into groups of similar or identical sequences, so that single representative clones may be 
selected from each group for sequencing. In a preferred embodiment, the method for 
clustering is used in the sequencing of a plurality of clones, comprising the steps of: 
5 interrogating each clone with a plurality of oligonucleotide probes; determining which 
probes bind to each clone and the signal intensity for eac probe; clustering clones into a 
plurality of groups by- identifying clones that bind to similar probes with similar 
intensities; and sequencing at least one clone from each group. In a more preferred 
embodiment, the plurality of probes comprises from about 50 to about 500 different 

10 probes. In a another more preferred embodiment, the plurality of probe comprises about 
300 different probes. In a most preferred embodiment, the plurality of clones are a 
plurality of cDNA clones. 

Still further, the invention relates to oligonucleotide probes complexed (covalent or 
noncovalent) to discrete particles wherein the particles can be grouped into a plurality of 

15 sets based on a physical property. In a preferred embodiment, a different probe is 
attached to the discrete particles of each set, and the identity of the probe is determined 
by identifying the physical property of the discrete particles. In an alternative 
embodiment, the probe is identified on the basis of a physical property of the probe. The 
physical property includes any that can be used to differentiate the discrete panicles, and 

20 includes, for example, size, flourescence, radioactivity, electromagnetic charge, or 
absorbance, or label(s) may be attached to the particle such as a dye, a radionuclide, or 
an EML. In a preferred embodiment, discrete panicles are separated by a flow cytometer 
which detects the size, charge, flourescence . or absorbance of the particle. 

The invention also relates to methods using the probes complexed with the discrete 

25 particles to analyze target nucleic acids. These probes may be used in any of the methods 
described above, with the modification of identifying the probe by the physical property 
of the discrete particle. These probes may also be used in a format III approach where 
the "free" probe is identified by a label, and the probe complexed to the discrete particle 
is identified by the physical property. In a preferred embodiment, the probes are used to 

30 sequence a target nucleic acid using SBH. 

The invention also relates to methods using agents which destabilize the binding of 
complementary polynucleotide strands (decrease the binding energy), or increase stability 
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of binding between complementary polynucleotide strands (increase the binding energy). 
In preferred embodiments, the agent is a trialkyl ammonium salt, sodium chloride, 
phosphate salts, borate salts, organic solvents such as formamide, glycol, 
dimethylsulfoxide, and dimethylformamide. urea, guanidinium. amino acid analogs such 
5 as betaine, poly amines such as spermidine and spermine, or other positively charged 
molecules which neutralize the negative charge of the phosphate backbone, detergents 
such as sodium dodecyl sulfate^ and sodium laufyl sarcosihate, minor/major groove 
binding agents, positively charged polypeptides, and intercalating agents such as acridine, 
ethidium bromide, and anthracine. In a preferred embodiment, an agent is used to reduce 

10 or increase the T m of a pair of complementary polynucleotides. In a more preferred 
embodiment, a mixture of the agents is used to reduce or increase the T m of a pair of 
complementary polynucleotides. In a most preferred embodiment, an agent or a mixture 
of agents is used to increase the discrimination of perfect matches from mismatches for 
complementary polynucleotides. In a preferred embodiment, the agent or agents are 

15 added so that the binding energy from an AT base pair is approximately equivalent to the 
binding energy of a GC base pair. The energy of binding of these complementary 
polynucleotides may be increased by adding an agent that neutralizes or shields the . 
negative charges of the phosphate groups in the polynucleotide backbone. 

20 BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 illustrates a top view of the apparatus for mass producing probe arrays. 
Figure 2 illustrates a side view of the apparatus for mass producing probe arrays. 
Figure 3 is an exploded side view of the dispensing unit of the apparatus for mass 
producing probe arrays. 

25 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Format 1 SBH is appropriate for the simultaneous analysis of a large set of 
samples. Parallel scoring of thousands of samples on large arrays may be performed in 
thousands of independent hybridization reactions using small pieces of membranes. The 
30 identification of DNA may involve 1-20 probes per reaction and the identification of 
mutations may in some cases involve more than 1000 probes specifically selected or 
designed for each sample. For identification of the nature of the mutated DNA segments, 
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specific probes may be synthesized or selected for each mutation detected in the first 
round of hybridizations. 

DNA samples may be prepared in small arrays which may be separated by 
appropriate spacers, and which may be simultaneously tested with probes selected from a 
5 set of oligonucleotides which may be arrayed in multiwell plates. Small arrays may 
consist of one or more samples. DNA samples in each small array may include mutants 
or individual samples of a sequence ."Consecutive small arrays may be organized into 
larger arrays. Such larger arrays may include replication of the same small array or may 
include arrays of samples of different DNA fragments. A universal set of probes includes 

10 sufficient probes to analyze a DNA fragment with prespecified precision, e.g. with 

respect to the redundancy of reading each base pair ("bp"). These sets may include more 
probes than are necessary for one specific fragment, but may include fewer probes than 
are necessary for testing thousands of DNA samples of different sequence. 

DNA or allele identification and a diagnostic sequencing process may include the 

15 steps of: 

1) Selection of a subset of probes from a dedicated, representative or universal 
set to be hybridized with each of a plurality of small arrays; 

2) Adding a first probe to each subarray on each of the arrays to be analyzed 
in parallel; 

20 3) Performing hybridization and scoring of the hybridization results; 

4) Stripping off previously used probes; 

5) Repeating hybridization, scoring and stripping steps for the remaining 
probes which are to be scored; 

5) Processing the obtained results to obtain a final analysis or to determine 
25 additional probes to be hybridized; 

6) Performing additional hybridizations for certain subarrays; and 

7) Processing complete sets of data and obtaining a final analysis. 

This approach provides fast identification and sequencing of a small number of 
nucleic acid samples of one type (e.g. DNA, RNA), and also provides parallel analysis of 
30 many sample types in the form of subarrays by using a presynthesized set of probes of 
manageable size. Two approaches have been combined to produce an efficient and 
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versatile process for the determination of DNA identity, for DNA diagnostics, and for 
identification of mutations. 

For the identification of known sequences, a small set of shorter probes may be 
used in place of a longer unique probe. In this approach, although there may be more 
5 probes to be scored, a universal set of probes may be synthesized to cover any type of 
sequence. For example, a full set of 6-mers includes only 4,096 probes, and a complete 
set of 7-mers includes only 16,384 probes. 

Full sequencing of a DNA fragment may be performed with two levels of 
hybridization. One level is hybridization of a sufficient set of probes that cover every base 
10 at least once. For this purpose, a specific set of probes may be synthesized for a standard 
sample. The results of hybridization with such a set of probes reveal whether and where 
mutations (differences) occur in non-standard samples. Further, this set of probes may 
include "negative" probes to confirm the hybridization results of the "positive" probes. 
To determine the identity of the changes, additional specific probes may be hybridized to 
15 the sample. This additional set of probes will have both "positive" (the mutant sequence) 
and "negative " probes, and the sequence changes will be identified by the positive probes 
and confirmed by the negative probes. 

In another embodiment, all probes from a universal set may be scored. A 
universal set of probes allows scoring of a relatively small number of probes per sample 
20 in a two step process without an undesirable expenditure of time. The hybridization 

process may involve successive probings, in a first step of computing an optimal subset of 
probes to be hybridized first and, then, on the basis of the obtained results, a second step 
of determining additional probes to be scored from among those in a universal set. Both 
sets of probes have "negative" probes that confirm the positive probes in the set. 
25 Further, the sequence that is obtained may then be confirmed in a separate step by 
hybridizing the sample with a set of "negative" probes identified from the SBH results. 

In SBH sequence assembly, K -1 oligonucleotides which occur repeatedly in 
analyzed DNA fragments due to chance or biological reasons may be subject to special 
consideration. If there is no additional information, relatively small fragments of DNA 
30 may be fully assembled in as much as every base pair is read several times. 

In the assembly of relatively longer fragments, ambiguities may arise due to the 
repeated occurrence in a set of positively-scored probes of a K-l sequence (i.e.. a 
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sequence shorter than the length of the probe). This problem does not exist if mutated or 
similar sequences have to be determined (i.e., the K-l sequence is not identically 
repeated). Knowledge of one sequence may be used as a template to correctly assemble a 
sequence known to be similar (e.g. by its presence in a database) by arraying the positive 
5 probes for the unknown sequence to display the best fit on the template. 

The use of an array of sample avoids consecutive scoring of many oligonucleotides 
— on a single sample or on a small set of samples^-This approach allows the scoring of 
more probes in parallel by manipulation of only one physical object. Subarrays of DNA 
samples 1000 bp in length may be sequenced in a relatively short period of time. If the 

10 samples are spotted at 50 subarrays in an array and the array is reprobed 10 times, 500 
probes may be scored. In screening for the occurrence of a mutation, enough probes may 
be used to cover each base three times. If a mutation is present, several covering probes 
will be affected. The use of information about the identity of negative probes may map 
the mutation with a two base precision. To solve a single base mutation mapped in this 

15 way, an additional 15 probes may be employed. These probes cover any base combination 
for two questionable positions (assuming that deletions and insertions are not involved). 
These probes may be scored in one cycle on 50 subarrays which contain a given sample. 
In the implementation of a multiple label color scheme (i.e., multiplexing), two to six 
probes, each having a different label such as a different fluorescent dye, may be used as a 

20 pool, thereby reducing the number of hybridization cycles and shortening the sequencing 
process. 

In more complicated cases, there may be two close mutations or insertions. They 

* 

may be handled with more probes. For example, a three base insertion may be solved 
with 64 probes. The most complicated cases may be approached by several steps of 
25 hybridization, and the selecting of a new set of probes on the basis of results of previous 
hybridizations. 

If subarrays to be analyzed include tens or hundreds of samples of one type, then 
several of them may be found to contain one or more changes (mutations, insertions, or 
deletions). For each segment where mutation occurs, a specific set of probes may be 
30 scored. The total number of probes to be scored for a type of sample may be several 
hundreds. The scoring of replica arrays in parallel facilitates scoring of hundreds of 
probes in a relatively small number of cycles. In addition, compatible probes may be 
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pooled. Positive hybridizations may be assigned to the probes selected to check particular 
DNA segments because these segments usually differ in 75% of their constituent bases. 

By using a larger set of longer probes, longer targets may be analyzed. These 
targets may represent pools of fragments such as pools of exon clones. 
5 A specific hybridization scoring method may be employed to define the presence 

of mutants in a genomic segment to be sequenced from a diploid chromosomal set. Two 
" variations are where: i) the sequence from one chromosome represents a known allele and 
the sequence from the other represents a new mutant; or, ii) both chromosomes contain 
new, but different mutants. In both cases, the scanning step designed to map changes 

10 gives a maximal signal difference of two-fold at the mutant position. Further, the method 
can be used to identify which alleles of a gene are carried by an individual and whether 
the individual is homozygous or heterozygous for that gene. 

Scoring two-fold signal differences required in the first case may be achieved 
efficiently by comparing corresponding signals with homozygous and heterozygous 

15 controls. This approach allows determination of a relative reduction in the hybridization 
signal for each particular probe in a given sample. This is significant because 
hybridization efficiency may vary more than two-fold for a particular probe hybridized 
with different nucleic acid fragments having the same full match target. In addition, 
different mutant sites may affect more than one probe depending upon the number of 

20 oligonucleotide probes. Decrease of the signal for two to four consecutive probes 
produces a more significant indication of a mutant site. Results may be checked by 
testing with small sets of selected probes among which one or few probes selected to give 
a full match signal which is on average eight-fold stronger than the signals coming from 
mismatch-containing duplexes. 

25 Partitioned membranes allow a very flexible organization of experiments to 

accommodate relatively larger numbers of samples representing a given sequence type, or 
many different types of samples represented with relatively small numbers of samples. A 
range of 4-256 samples can be handled with particular efficiency. Subarrays within this 
range of numbers of dots may be designed to match the configuration and size of standard 

30 multiwell plates used for storing and labeling oligonucleotides. The size of the subarrays 
may be adjusted for different number of samples, or a few standard subarray sizes may be 
used. If all samples of a type do not fit in one subarray, additional subarrays or 
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membranes may be used and processed with the same probes. In addition, by adjusting 
the number of replicas for each subarray, the time for completion of identification or 
sequencing process may be varied. 

i 

As used herein, "intermediate fragment" means an oligonucleotide between 5 and 
5 1000 bases in length, and preferably between 10 and 40 bp in length. 

In Format 3, a first set of oligonucleotide probes of known sequence is 
immobilized on a solid support under conditions which permit them to hybridize with 
nucleic acids having respectively complementary sequences. A labeled, second set of 
oligonucleotide probes is provided in solution. Both within the sets and between the sets 
10 the probes may be of the same length or of different lengths. A nucleic acid to be 
sequenced or intermediate fragments thereof may be applied to the first set of probes in 
double-stranded form (especially where a recA protein is present to permit hybridization 
under non-denaturing conditions), or in single-stranded form and under conditions which 
permit hybrids of different degrees of complementarity (for example, under conditions 
15 which allow discrimination between full match and one base pair mismatch hybrids). The 

i 

nucleic acid to be sequenced or intermediate fragments thereof may be applied to the first 
set of probes before, after or simultaneously with the second set of probes. Probes that 
bind to adjacent sites on the target are bound together (e.g., by stacking interactions or by 
a ligase or other means of causing chemical bond formation between the adjacent 

20 probes). After permitting adjacent probes to be bound, fragments and probes which are 
not immobilized to the surface by chemical bonding to a member of the first set of probe 
are washed away, for example, using a high temperature (up to 100 degrees C) wash 
solution which melts hybrids. The bound probes from the second set may then be 
detected using means appropriate to the label employed (which may, for example, be 

25 chemiluminescent, fluorescent, radioactive, enzymatic, densitometric, or electrophore 
mass labels). 

Herein, nucleotide bases "match" or are "complementary" if they form a stable 
duplex by hydrogen bonding under specified conditions. For example, under conditions 
commonly employed in hybridization assays, adenine ("A") matches thymine ("T"), but 
30 not guanine ("G") or cytosine ("C"). Similarly, G matches C, but not A or T. Other 
bases which will hydrogen bond in less specific fashion, such as inosine or the Universal 
Base ("M" base, Nichols et al 1994), or other modified bases, such as methylated bases, 
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for example, are complementary to those bases for which they form a stable duplex under 
specified conditions. A probe is said to be "perfectly complementary" or is said to be a 
"perfect match" if each base in the probe forms a duplex by hydrogen bonding to a base 
in the nucleic acid to be sequenced according to the Watson and Crick base paring rules 

5 (i.e., absent any surrounding sequence effects, the duplex formed has the maximal 
binding energy for a particular probe). "Perfectly complementary" and "perfect match" 
are also meant to encompass probes which have analogs or modified nucleotides. A 
"perfect match" for an analog or modified nucleotide is judged according to a "perfect 
match rule" selected for that analog or modified nucleotide (e.g., the binding pair that has 

10 maximal binding energy for a particular analog or modified nucleotide). Each base in a 
probe that does not form a binding pair according to the "rules" is said to be a 
"mismatch" under the specified hybridization conditions. 

A list of probes may be assembled wherein each probe is a perfect match to the 
nucleic acid to be sequenced. The probes on this list may then be analyzed to order them 

15 in maximal overlap fashion. Such ordering may be accomplished by comparing a first 
probe to each of the other probes on the list to determine which probe has a 3 ' end 
which has the longest sequence of bases identical to the sequence of bases at the 5' end of 
a second probe. The first and second probes may then be overlapped, and the process 
may be repeated by comparing the 5' end of the second probe to the 3' end of all of the 

20 remaining probes and by comparing the 3' end of the first probe with the 5' end of all of 
the remaining probes. The process may be continued until there are no probes on the list 
which have not been overlapped with other probes. Alternatively, more than one probe 
may be selected from the list of positive probes, and more than one set of overlapped 
probes ("sequence nucleus") may be generated in parallel. The list of probes for either 

25 such process of sequence assembly may be the list of all probes which are perfectly 
complementary to the nucleic acid to be sequenced or may be any subset thereof. 

The 5' and 3' ends of the probes may be overlapped to generate longer stretches 
of sequence. This process of assembling probes continues until an ambiguity arises 
because of a branch point (a probe is repeated in the fragment) , repetitive sequences 

30 longer than the probes, or an uncloned segment. The stretches of sequence between any 
two ambiguities are referred to as fragment of a subclone sequence (Sfs). Where 
ambiguities arise in sequence assembly due to the availability of alternative proper 
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overlaps with probes, hybridization with longer probes spanning the site of overlap 
alternatives, competitive hybridization, ligation of alternative end to end pairs of probes 
spanning the site of ambiguity or single pass gel analysis (to provide an unambiguous 
ordering of Sfs) may be used. 

By employing the above procedures, one may obtain any desired level of 
sequence, from a pattern of hybridization (which may be correlated with the identity of a 
nucleic acid sample to serve as a signature for identifying the nucleic acid sample) to 
overlapping or non-overlapping probes up through assembled Sfs and on to complete 
sequence for an intermediate fragment or an entire source DNA molecule (e.g. a 
chromosome). 

Sequencing may generally comprise the following steps: 

(a) contacting an array of immobilized oligonucleotide probes with a nucleic 
acid fragment under conditions effective to allow the fragment to form a primary complex 
with an immobilized probe having a complementary sequence; 

(b) contacting this primary complex with a set of labeled oligonucleotide 
probes in solution under conditions effective to allow the primary complex to hybridize to 
the labeled probe, thereby forming secondary complexes wherein the fragment is 
hybridized with both an immobilized probe and a labeled probe; 

(c) removing from a secondary complex any labeled probe that has not 
hybridized adjacent to an immobilized probe; 

(d) detecting the presence of adjacent labeled and unlabeled probes by detecting 
the presence of the label; and 

(e) determining a nucleotide sequence of the fragment by connecting the known 
sequence of the immobilized and labeled probes. 

Hybridization and washing conditions may be selected to detect substantially 
perfect match hybrids (such as those wherein the fragment and probe hybridize at six out 
of seven positions), may be selected to allow differentiation of perfect matches and one 
base pair mismatches, or may be selected to permit detection only of perfect match 
hybrids. 

Suitable hybridization conditions may be routinely determined by optimization 
procedures or pilot studies. Such procedures and studies are routinely conducted by those 
skilled in the an to establish protocols for use in a laboratory. See e.g., Ausubel et al., 
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Current Protocols in Molecular Biology, Vol. 1-2, John Wiley & Sons (1989); Sambrook 

et ah, Molecular Cloning A Laboratory Manual. 2nd Ed.. Vols. 1-3, Cold Springs 

Harbor Press (1989); and Maniatis et al., Molecular Cloning: A Laboratory Manual, 

Cold Spring Harbor Laboratory Cold Spring Harbor, New York (1982), all of which are 
5 incorporated by reference herein. For example, conditions such as temperature, 

concentration of components, hybridization and washing times, buffer components, and 

their pH and ionic strength may be varied. 

In embodiments wherein the labeled and immobilized probes are not physically or 

chemically linked, detection may rely solely on washing steps of controlled stringency. 
10 Under such conditions, adjacent probes have increased binding affinity because of 

stacking interactions between the adjacent probes. Conditions may be varied to optimize 

the process as described above. 

In embodiments wherein the immobilized and labeled probes are ligated. ligation 

may be implemented by a chemical ligating agent (e.g. water-soluble carbodiimide or 
15 cyanogen bromide), or a ligase enzyme, such as the commercially available T 4 DNA 

ligase may be employed. The washing conditions may be selected to distinguish between 

adjacent versus nonadjacent labeled and immobilized probes exploiting the difference in 

stability for adjacent probes versus nonadjacent probes. 

Oligonucleotide probes may be labeled with fluorescent dyes, chemiluminescent 
20 systems, radioactive labels (e.g., 35 S, 3 H, 32 P or 33 P) or with isotopes detectable by mass 

spectrometry. 

Where a nucleic acid molecule of unknown sequence is longer than about 45 or 50 
bp, the molecule may be fragmented and the sequences of the fragments determined. 
Fragmentation may be accomplished by restriction enzyme digestion, shearing or NaOH. 
25 Fragments may be separated by size (e.g. by gel electrophoresis) to obtain a preferred 
fragment length of about ten to forty bps. 

Oligonucleotides may be immobilized, by a number of methods known to those 
skilled in the an, such as laser-activated photodeprotection attachment through a 
phosphate group using reagents such as a nucleoside phosphoramidite or a nucleoside 
30 hydrogen phosphorate. Glass, nylon, silicon and fluorocarbon supports may be used. 

Oligonucleotides may be organized into arrays, and these arrays may include all or 
a subset of all probes of a given length, or sets of probes of selected lengths. 
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Hydrophobic partitions may be used to separate probes or subarrays of probes. Arrays 
may be designed for various applications (e.g. mapping, partial sequencing, sequencing of 
targeted regions for diagnostic purposes, mRNA sequencing and large scale sequencing). 
A specific chip may be designed to be dedicated to a particular application by selecting a 

5 combination and arrangement of probes on a substrate. 

For example. 1024 immobilized probe arrays of all oligonucleotide probes 5 bases 

- in length (each array containing 1024 distinct probes) may be constructed; The probes in 
this example are 5-mers in an informational sense (they may actually be longer probes). 
A second set of 1024 5-mer probes may be labeled, and one of each labeled probe may be 

i 

10 applied to an array of immobilized probes along with a fragment to be sequenced. In 
this example, 1024 arrays would be combined in a large superarray, or "superchip." In 
those instances where an immobilized probe and one of the labeled probes hybridize end 
-to-end along a nucleic acid fragment., the two probes are joined, for example by 
ligation, and, after removing unbound label, 10-mers complementary to the sample 

15 fragment are detected by the correlation of the presence of a label at a point in an array 
having an immobilized probe of known sequence to which was applied a labeled probe of 
known sequence. The sequence of the sample fragment is simply the sequence of the 
immobilized probe continued in the sequence of the labeled probe. In this way, all one 
million possible 10-mers may be tested by a combinatorial process which employs only 

20 5-mers and which thus involves one thousandth of the amount of effort for oligonucleotide 
synthesis. 

In a preferred embodiment, the substrate which supports the array of 
oligonucleotide probes is partitioned into sections so that each probe in the array is 
separated from adjacent probes by a physical barrier which may be, for example, a 
25 hydrophobic material. In a preferred embodiment, the physical barrier has a width of 
from 100 ixm to 30 /xm. In a more preferred embodiment, the distance from the center of 
each probe to the center of any adjacent probes is 325 jam. These arrays of probes may 
be "mass-produced" using a nonmoving, fixed substrate or a substrate fixed to a rotating 
drum or plate with an ink-jet deposition apparatus, for example, a microdrop dosing head; 
30 and a suitable robotic system, for example, an anorad gantry. 

In an alternative preferred embodiment, the oligonucleotide probes are fixed to a 
three-dimensional array. The three-dimensional array is comprised of multiple layers, 
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and each layer may be analyzed separate and apart from the other layers. The three 
dimensional array may take a number of forms, including, for example, the array may be 
disposed on a substrate having multiple depressions with probes located at different depths 
within the depressions (each level is made up of probes at similar depths within the 
5 depression); or the array may be disposed on a substrate having depressions of different 
depths with the probes located at the bottom of the depression, or at the peaks separating 
the depressions or some combination of peaks and depressions may be used (each level is 
made up of all the probes at a certain depth); or the array may be disposed on a substrate 
comprised of multiple sheets that are layered to form a three-dimensional array. 

10 The probes in these arrays may include spacers that increase the distance between 

the surface of the substrate and the informational portion of the probes. The spacers may 
be comprised of atoms capable of forming at least two covalent bonds such as carbon, 
silicon, oxygen, sulfur, phosphorous, and the like, or may be comprised of molecules 
capable of forming at least two covalent bonds such as sugar-phosphate groups, amino 

15 acids, peptides, nucleosides, nucleotides, sugars, carbohydrates, aromatic rings, 
hydrocarbon rings, linear and branched hydrocarbons, and the like. 

A nucleic acid sample to be sequenced may be fragmented or otherwise treated 
(for example, by the use of recA) to avoid hindrance to hybridization from secondary 
structure in the sample. The sample may be fragmented by, for example, digestion with 

20 a restriction enzyme such as Cvi JI, physical shearing (e.g. by ultrasound ), or by 
NaOH treatment. The resulting fragments may be separated by gel electrophoresis and 
fragments of an appropriate length, such as between about 10 bp and about 40 bp. may be 
extracted from the gel. In a preferred embodiment, the ** fragments" of the nucleic acid 
sample cannot be ligated to other fragments in the pool. Such a pool of fragments may 

25 be obtained by treating the fragmented nucleic acids with a phosphatase (e.g., calf 

intestinal phosphatase). Alternatively, nonligatable fragments of the sample nucleic acid 
may be obtained by using random primers (e.g., N 5 -N 9 , where N = A, G, T. or C) in a 
Sanger-dideoxy sequencing reaction with the sample nucleic acid. This will produce 
fragments of DNA that have a complementary sequence to the target nucleic acid and that 

30 are terminated in a dideoxy residue that cannot be ligated to other fragments. 

A reusable Format 3 SBH array may be produced by introducing a cleavable bond 
between the fixed and labeled probes and then cleaving this bond after a round of Format 
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3 analyzes is finished. The labeled probes may be ribonucleotides or a ribonucleotide 
may be used as the joining base in the labeled probe so that this probe may subsequently 
be removed, e.g., by RNAse or uracil-DNA glycosylate treatment, or NaOH treatment. 
In addition, bonds produced by chemical ligation may be selectively cleaved. 

5 Other variations include the use of modified oligonucleotides to increase specificity 

or efficiency, cycling hybridizations to increase the hybridization signal, for example by 
performing" a hybridization cycle under "conditions (e.g. temperature) optimally selected 
for a first set of labeled probes followed by hybridization under conditions optimally 
selected for a second set of labeled probes. Shifts in reading frame may be determined 

10 by using mixtures (preferably mixtures of equimolar amounts) of probes ending in each 

i 

of the four nucleotide bases A, T, C and G. 

Branch points produce ambiguities as to the ordered sequence of a fragment. 
Although the sequence information is determined by SBH, either: (i) long read length, 
single-pass gel sequencing at a fraction of the cost of complete gel sequencing; or (ii) 

15 comparison to related sequences, may be used to order hybridization data where such 
ambiguities ("branch points") occur. Primers for single pass gel sequencing through the 
branch points are identified from the SBH sequence information or from known vector 
sequences, e.g., the flanking sequences to the vector insert site, and standard Sanger- 
sequencing reactions are performed on the sample nucleic acid. The sequence obtained 

20 from this single pass gel sequencing is compared to the Sfs that read into and out of the 
branch points to identify the order of the Sfs. Alternatively, the Sfs may be ordered by 
comparing the sequence of the Sfs to related sequences and ordering the Sfs to produce a 
sequence that is closest to the related sequence. 

In addition, the number of tandem repetitive nucleic acid segments in a target 

25 fragment may be determined by single-pass gel sequencing. As tandem repeats occur 
rarely in protein-encoding portions of a gene, the gel-sequencing step will be performed 
only when one of these noncoding regions is identified as being of particular interest 
(e.g., if it is an important regulatory region). 

Obtaining information about the degree of hybridization exhibited for a set of only 

30 about 200 oligonucleotides probes (about 5 % of the effort required for complete 
sequencing) defines a unique signature of each gene and may be used for sorting the 
cDNAs from a library to determine if the library contains multiple copies of the same 
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gene. By such signatures, identical, similar and different cDNAs can be distinguished 
and inventoried. 

Nucleic acids and methods for isolating, cloning and sequencing nucleic acids are 
well known to those of skill in the art. See e.g., Ausubel et al.. Current Protocols in 
5 Molecular Biology, Vol. 1-2, John Wiley & Sons (1989); and Sambrook et al.. Molecular 
Cloning A Laboratory Manual, 2nd Ed., Vols. 1-3, Cold Springs Harbor Press (1989), 
both of which are incorporated by reference herein: 

SBH is a well developed technology that may be practiced by a number of 
methods known to those skilled in the art. Specifically, techniques related to sequencing 

10 by hybridization of the following documents is incorporated by reference herein: 

Drmanac et al., U.S. Patent No. 5,202,231 (hereby incorporated by reference herein) - 
Issued April 13, 1993; Drmanac et al., Genomics, 4, 114-128 (1989); Drmanac et al., 
Proceedings of the First Int 7. Conf. Electrophoresis Supercomputing Human Genome 
Cantor et al. eds. World Scientific Pub. Co., Singapore, 47-59 (1991); Drmanac et al., 

15 Science. 260, 1649-1652 (1993); Lehrach et al., Genome Analysis: Genetic and Physical 
Mapping, 1, 39-81 (1990), Cold Spring Harbor Laboratory Press; Drmanac et al., Nucl. 
Acids Res., 4691 (1986); Stevanovic et al., Gene, 79, 139 (1989); Panusku et al., MoL 
Biol. EvoL, 1, 607 (1990); Nizetic et al., NucL Acids Res., 19, 182 (1991); Drmanac et 
al., J. BiomoL Struct. Dyn., 5, 1085 (1991); Hoheisel et al., MoL Gen., 4. 125-132 

20 (1991); Strezoska et al., Proc. Nat'l. Acad. Sci. (USA), 88, 10089 (1991); Drmanac et 
al., NucL Acids Res., 19, 5839 (1991); and Drmanac et al., Int. J. Genome Res., 1, 
59-79 (1992). 

The present invention is illustrated in the following examples. Upon consideration 
25 of the present disclosure, one of skill in the art will appreciate that many other 
embodiments and variations may be made in the scope of the present invention. 
Accordingly, it is intended that the broader aspects of the present invention not be limited 
to the disclosure of the following examples. 

30 
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EXAMPLE 1 

Preparation of Sets of Probes 

Two types of universal sets of probes may be prepared. The first is a complete set 
(or at least a noncomplementary subset) of relatively short probes, for example all 4096 
5 (or about 2000 non-complementary) 6-mers, or all 16,384 (or about 8,000 

non-complementary) 7-mers. Full noncomplementary subsets of 8-mers and longer probes 
- are less convenient inasmuch as they include 32,000 or more probes. 

A second type of probe set is selected as a small subset of probes still sufficient 
for reading every bp in any sequence with at least with one probe. For example, 12 of 16 
10 dimers are sufficient. A small subset for 7-mers, 8-mer and 9-mers for sequencing double 
stranded DNA may be about 3000, 10,000 and 30,000 probes, respectively. 

Sets of probes may also be selected to identify a target nucleic acid of known 
sequence, and/or to identify alleles or mutants of a target nucleic acid with a known 
sequence. Such a set of probes contains sufficient probes so that every nucleotide 
15 position of the target nucleic acid is read at least once. Alleles or mutants are identified 
by the loss of binding of one of the "positive" probes. The specific sequence of these 
alleles or mutants is then determined by interrogating the target nucleic acid with sets of 
probes that contain every possible nucleotide change and combination of changes at these 
probe positions. 

20 Sets of probes may also be comprised of from 50 probes to a universal set of 

probes (all probes of a certain length), more preferably the set is comprised of 100-500 
probes, and in a most preferred embodiment, the probe set contains 300 probes. In a 
preferred embodiment, the set of probes are 6-9 nucleotides in length, and are used to 
cluster cDNA clones into groups of similar or identical sequences, so that single 

25 representative clones may be selected from each group for sequencing. 

Probes may be prepared using standard chemistry with one to three non-specified 
(mixed A,T,C and G) or universal (e.g. M base or inosine) bases at the ends. If 
radiolabelling is used, probes may have an OH group at the 5' end for kinasing by 
radiolabeled phosphorous groups. Alternatively, probes labelled with any compatible 

30 system, such as fluorescent dyes, may be employed. Other types of probes, such as PNA 
(Protein Nucleic Acids)or probes containing modified bases which change duplex stability 
also may be used. 
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Probes may be stored in bar-coded multiwell plates. For small numbers of probes, 
96-well plates may be used; for 10,000 or more probes, storage in 384- or 864-well 
plates is preferred. Stacks of 5 to 50 plates are enough to store all probes. Approximately 
5 pg of a probe may be sufficient for hybridization with one DNA sample. Thus, from a 
5 small synthesis of about 50 mg per probe, ten million samples may be analyzed. If each 
probe is used for every third sample, and if each sample is 1000 bp in length, then over 
30 billion bases (10 human genomes) may be sequenced by a set of 5,000 probes. 

i 

i 

EXAMPLE 2 
10 Probes Having Modified Oligonucleotides 

Modified oligonucleotides may be introduced into hybridization probes and used 
under appropriate conditions therefor. For example, pyrimidines with a halogen at the 
exposition may be used to improve duplex stability by influencing base stacking. 
2,6-diaminopurine may be used to provide a third hydrogen bond in base pairing with 

15 thymine, thereby thermally stabilizing DNA-duplexes. Using 2,5-diaminopurine may 
increase duplex stability to allow more stringent conditions for annealing, thereby 
improving the specificity of duplex formation, suppressing background problems and 
permitting the use of shorter oligomers. 

The synthesis of the triphosphate versions of these modified nucleotides is 

20 disclosed by Hoheisel & Lehrach (1990). 

One may also use the non-discriminatory base analogue, or universal base, as 
designed by Nichols et al. (1994). This new analogue, l-(2 -deoxy- 
-D-ribfuranosyl)-3-nitropyrrole (designated M), was generated for use in oligonucleotide 
probes and primers for solving the design problems that arise as a result of the 

25 degeneracy of the genetic code, or when only fragmentary peptide sequence data are 
available. This analogue maximizes stacking while minimizing hydrogen-bonding 
interactions without sterically disrupting a DNA duplex. 

The M nucleoside analogue was designed to maximize stacking interactions using 
aprotic polar substituents linked to heteroaromatic rings, enhancing intra- and inter-strand 

30 stacking interactions to lessen the role of hydrogen bonding in base-pairing specificity. 
Nichols et al. (1994) favored 3-nitropyrrole 2 -deoxyribonucleoside because of its 
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structural and electronic resemblance to p-nitroaniline. whose derivatives are among the 
smallest known intercalators of double-stranded DNA. 

The dimethoxytrityl-protected phosphoramidite of nucleoside M is also available 
for incorporation into nucleotides used as primers for sequencing and polymerase chain 
5 reaction (PCR). Nichols et ai (1994) showed that a substantial number of nucleotides 
can be replaced by M without loss of primer specificity. 

A~unique property of M is its ability to replace long strings of contiguous 
nucleosides and still yield functional sequencing primers. Sequences with three, six and 
nine M substitutions have all been reported to give readable sequencing ladders, and PCR 
10 with three different M-containing primers all resulted in amplification of the correct 
product (Nichols et al f 1994). 

The ability of 3-nitropyrrole-containing oligonucleotides to function as primers 
strongly suggests that a duplex structure must form with complementary strands. Optical 
thermal profiles obtained for the oligonucleotide pairs d(5 -C 2 -T 5 XT 5 G 2 -3 ) and d(5 
15 -G,A 5 YA 5 G2-3 ) (where X and Y can be A, C, G, T or M) were reported to fit the 
normal sigmoidal pattern observed for the DNA double-to single strand transition. The 
Tm values of the oligonucleotides containing X M base pairs (where X was A. C, G or 
T, and Y was M) were reported to all fall within a 3°C range (Nichols et aL, 1994). 

20 EXAMPLE 3 

Selection and Labeling of Probes 

When an array of subarrays is produced, the sets of probes to be hybridized in 
each of the hybridization cycles on each of the subarrays is defined. For example, a set 
of 384 probes may be selected from the universal set, and 96 probings may be performed 
25 in each of 4 cycles. Probes selected to be hybridized in one cycle preferably have similar 
G-f C contents. 

Selected probes for each cycle are transferred to a 96-well plate and then are 
labelled by kinasing or by other labeling procedures if they are not labelled (e.g. with 
stable fluorescent dyes) before they are stored. 
30 On the basis of the first round of hybridizations, a new set of probes may be 

defined for each of the subarrays for additional cycles. Some of the arrays may not be 
used in some of the cycles. For example, if only 8 of 64 patient samples exhibit a 
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mutation and 8 probes are scored first for each mutation, then all 64 probes may be 
scored in one cycle and 32 subarrays are not used. These unused subarrays may then be 
treated with hybridization buffer to prevent drying of the filters. 

Probes may be retrieved from the storing plates by any convenient approach, such 

5 as a single channel pipetting device, or a robotic station, such as a Beckman Biomek 1000 

(Beckman Instruments, Fullerton, California) or a Mega Two robot (Megamation, 
- Lawrenceville, New Jersey). A robotic station may be integrated with data analysis 
programs and probe managing programs. Outputs of these programs may be inputs for 
one or more robotic stations. 

10 Probes may be retrieved one by one and added to subarrays covered by 

hybridization buffer. It is preferred that retrieved probes be placed in a new plate and 
labelled or mixed with hybridization buffer. The preferred method of retrieval is by 
accessing stored plates one by one and pipetting (or transferring by metal pins) a 
sufficient amount of each selected probe from each plate to specific wells in an 

15 intermediary plate. An array of individually addressable pipettes or pins may be used to 
speed up the retrieval process. 

EXAMPLE 4 

Preparation of Labeled Probes 

20 The oligonucleotide probes may be prepared by automated synthesis, which is 

routine to those of skill in the art, for example, using and Applied Biosystems system. 
Alternatively, probes may be prepared using Genosys Biotechnologies Inc. Methods using 
stacks of porous Teflon wafers. 

Oligonucleotide probes may be labeled with, for example, radioactive labels ( 35 S, 

25 32 P, 33 P. and preferably, 33 P) for arrays with 100-200 urn or 100-400 urn spots; 
non-radioactive isotopes (Jacobsen et aL, 1990); or fluorophores (Brumbaugh et aL, 
1988). All such labeling methods are routine in the art, as exemplified by the relevant 
sections in Sambrook et aL (1989) and by further references such as Schubert et aL 
(1990), Murakami et aL (1991) and Cate et aL (1991), all articles being specifically 

30 incorporated herein by reference. 
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In regard to radiolabelling, the common methods are end-labeling using T4 
polynucleotide kinase or high specific activity labeling using Klenow or even T7 
polymerase. These are described as follows. 

Synthetic oligonucleotides are synthesized without a phosphate group at their 5 
5 termini and are therefore easily labeled by transfer of the - 32 P or - 33 P from [ - 32 P]ATP or 
[ - 33 P]ATP using the enzyme bacteriophage T4 polynucleotide kinase. If the reaction is 
carried out efficiently, the specificity activity of such probes can be as high as the specific 
activity of the [ - 32 P]ATP or [ - 33 P]ATP itself. The reaction described below is designed 
to label 10 pmoles of an oligonucleotide to high specific activity. Labeling of different 
10 amounts of oligonucleotide can easily be achieved by increasing or decreasing the size of 
the reaction, keeping the concentrations of all components constant. 

A reaction mixture would be created using 1.0 ul of oligonucleotide (10 
pmoles/ul); 2.0 ul of 10 x bacteriophage T4 polynucleotide kinase buffer; 5.0 ul of [ 
- 32 P]ATP or [ - 33 P]ATP (sp. Act. 5000 Ci/mmole; 10 mCi/ml in aqueous solution) (10 
15 pmoles); and 11.4 ul of water. Eight (8) units (-1 ul) of bacteriophage T4 

polynucleotide kinase is added to the reaction mixture, and incubated for 45 minutes at, 
37°C. The reaction is heated for 10 minutes at 68°C to inactivate the bacteriophage T4 
polynucleotide kinase. 

The efficiency of transfer of 32 P or 33 P to the oligonucleotide and its specific 
20 activity is then determined. If the specific activity of the probe is acceptable, it is 

purified. If the specific activity is too low, an additional 8 units of enzyme is added and 
incubated for a further 30 minutes at 37°C before heatine the reaction for 10 minutes at 
68°C to inactivate the enzyme. 

Purification of radiolabeled oligonucleotides can be achieved by, e.g., precipitation 
25 with ethanol; precipitation with cetylpyridinium bromide; by chromatography through 
bio-gel P-60; or by chromatography on a Sep-Pak C 18 column, or by polyacrylamide gel 
electrophoresis. 

Probes of higher specific activities can be obtained using the Klenow fragment of 
£. coli. DNA polymerase I to synthesize a strand of DNA complementary to the 
30 synthetic oligonucleotide. A short primer is hybridized to an oligonucleotide template 
whose sequence is the complement of the desired radiolabeled probe. The primer is then 
extended using the Klenow fragment of E. coli DNA polymerase I to incorporate [ - 32 P] 
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dNTPs or [ - 3 P] dNTPs in a template-directed manner. After the reaction, the template 
and product are separated by denaturation followed by electrophoresis through a 
poly aery lamide gel under denaturing conditions. With this method, it is possible to 
generate oligonucleotide probes that contain several radioactive atoms per molecule of 
5 oligonucleotide. 

To use this method, one would mix in a microfuge tube the calculated amounts of 
" [a-32P]dNTPs or [a-33P]dNTPs necessary to achieve the desired specific activity and 
sufficient to allow complete synthesis of all template. strands. Then add to the tube the 
appropriate amounts of primer and template DNAs, with the primer being in three- to 
10 tenfold molar excess over the template. 

0.1 volume of 10 x Klenow buffer would then be added and mixed well. 2-4 units 
of the Klenow fragment of E. coli DNA polymerase I would then be added per 5 ul of 
reaction volume, mixed and incubated for 2-3 hours at 4°C. If desired, the process of the 
reaction may be monitored by removing small (0.1 ul) aliquots and measuring the 
15 proportion of radioactivity that has become precipitable with 10% trichloroacetic acid 
(TC A) . 

The reaction would be diluted with an equal volume of gel-loading buffer, heated 
to 80oC for 3 minutes, and then the entire sample loaded on a denaturing poly aery lamide 
gel. Following electrophoresis, the gel is autoradiographed, allowing the probe to be 

20 localized and removed from the gel. Various methods for fluorescent probe labeling are 
also available, e.g., Brumbaugh et al. (1988) describe the synthesis of fluorescently 
labeled primers. A deoxyuridine analog with a primary amine "linker amT of 12 atoms 
attached at C-5 is synthesized. Synthesis of the analog consists of derivatizing 2 
-deoxyuridine through organometallic intermediates to give 5 (methyl propenoyl)-2 

25 -deoxyuridine. Reaction with dimethoxytrityl-chloride produces the corresponding 5 
-dimethoxytrityl adduct. The methyl ester is hydrolyzed, activated, and reacted with an 
appropriately monoacylated alkyl diamine. After purification, the resultant linker arm 
nucleosides are converted to nucleoside analogs suitable for chemical oligonucleotide 
synthesis. 

30 Oligonucleotides would then be made that include one or two linker arm bases by 

using modified phosphoridite chemistry. To a solution of 50 nmol of the linker arm 
oligonucleotide in 25 ul of 500 mM sodium biocarbonate (pH 9.4) is added 20 ul of 300 
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mM FITC in dimethyl sulfoxide. The mixture is agitated at room temperature for 6 hrs. 
The oligonucleotide is separated from free FITC by elution form a 1 x 30 cm Sephadex 
G-25 column with 20 mM ammonium acetate (pH 6), combining fractions in the first 
UV-absorbing peak. 

5 In general, fluorescent labeling of an oligonucleotide at its 5 '-end initially involved 

two steps. First, a TV-protected aminoalkyl phosphoramidite derivative is added to the 5'- 
end of an oligonucleotide during automated nucleic acid synthesis. After removal of all 
protecting groups, the NHS ester of an appropriate fluorescent dye is coupled to the 5'- 
amino group overnight followed by purification of the labeled oligonucleotide from the 
10 excess of dye using reverse phase HPLC or PAGE. 

i 

Schubert et aL (1990) described the synthesis of a phosphoramidite that enables 
oligonucleotides labeled with fluorescein to be produced during automated DNA 
synthesis. 

Murakami et aL also described the preparation of flourescein-labeled 
15 oligonucleotides. 

Cate et aL (1991) describe the use of oligonucleotide probes directly conjugated to 
alkaline phosphatase in combination with a direct chemiluminescent substrate (AMPPD) to 
allow probe detection. 

Labeled probes could readily be purchased form a variety of commercial sources, 
20 including GENSET, rather then synthesized. 

Other labels include ligands which can serve as specific binding members to a 
labeled antibody, chemiluminescers, enzymes, antibodies which can serve as a specific 
binding pair member for a labeled ligand, and the like. A wide variety of labels have 
been employed in immunoassays which can readily be employed. Still other labels 
25 include antigens, groups with specific reactivity, and electrochemically detectable 
moeities. 

In general, labeling of nucleic acids with electrophore mass labels ("EML") is 
described, for example, in Xu et aL, J. Chromatography 764:95-102 (1997). 
Electrophores are compounds that can be detected with high sensitivity by electron 
30 capture mass spectrometry (EC-MS). EMLs can be attached to a probe using chemistry 
that is well known in the art for reversibly modifying a nucleotide (e.g., well known 
nucleotide synthesis chemistry teaches a variety of methods for attaching molecules to 
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nucleotides as protecting groups). EMLs are detected using a variety of well known 
electron capture mass spectrometry devices (e.g., devices sold by Finnigan Corporation). 
Further, techniques that may be used in the detection of EMLs include, for example, fast 
atomic bombardment mass spectrometry (see, e.g., Koster et ai. Biomedical Environ. 
5 Mass Spec. 14:111-116 (1987)); plasma desorption mass spectrometry; 

electrospray/ionspray (see, e.g., Fenn et al, J. Phys. Chem. 88:4451-59 (1984), PCT 
Appln. No. WO 90/14148, Smith et aL, Anal. Chem. 62:882-89 (1990)); and matrix- 

i 

assisted laser desorption/ionization (Hillenkamp, et a/., "Matrix Assisted UV-Laser 
Desorption/Ionization: A New Approach to Mass Spectrometry of Large Biomolecules," 

10 Biological Mass Spectrometry (Burlingame and McCloskey, eds.), Elsevier Science 
Publishers, Amsterdam, pp. 49-60, 1990); Huth-Fehre et aL; "Matrix Assisted Laser 
Desorption Mass Spectrometry of Oligodeoxythymidylic Acids," Rapid Communications 
in Mass Spectrometry, 6:209-13 (1992)). 

In preferred embodmients, the EMLs are attached to a probe by a covalent bond 

15 that is light sensitive. The EML is released from the probe after hybridization with a 

target nucleic acid by a laser or other light source emitting the desired wavelength of 

i 

light. The EML is then fed into a GC-MS (gas chromatograph -mass spectrometer) or 
other appropriate device, and identified by its mass. 

20 EXAMPLES 

Preparation of Sequencing Chips and Arrays 

A basic example is using 6-mers attached to 50 micron surfaces to give a chip 

with dimensions of 3 x 3 mm which can be combined to give an array of 20 x 20 cm. 

Another example is using 9-mer oligonucleotides attached to 10 x 10 microns surface to 
25 create a 9-mer chip, with dimensions of 5 x 5 mm. 4000 units of such chips may be used 

to create a 30 x 30 cm array. In an array in which 4,000 to 16,000 oligochips are 

arranged into a square array. A plate, or collection of tubes, as also depicted, may be 

packaged with the array as part of the sequencing kit. 

The arrays may be separated physically from each other or by hydrophobic 
30 surfaces. One possible way to utilize the hydrophobic strip separation is to use 

technology such as the Iso-Grid Microbiology System produced by QA Laboratories, 

Toronto, Canada. 



-29- 



WO 99/09217 PCT/US98/16966 

Hydrophobic grid membrane filters (HGMF) have been in use in analytical food 
microbiology for about a decade where they exhibit unique attractions of extended 
numerical range and automated counting of colonies. One commercially-available grid is 
ISO-GRID™ from QA Laboratories Ltd. (Toronto, Canada) which consists of a square 
5 (60 x 60 cm) of polysulfone polymer (Gelman Tuffryn HT-450, 0.45u pore size) on 
which is printed a black hydrophobic ink grid consisting of 1600 (40 x 40) square cells. 
HGMF have previously been inoculated with bacterial suspensions by vacuum filtration 
and incubated on the differential or selective media of choice. 

Because the microbial growth is confined to grid cells of known position and size 

10 on the membrane, the HGMF functions more like an MPN apparatus than a conventional 
plate or membrane filter. Peterkin et al (1987) reported that these HGMFs can be used 
to propagate and store genomic libraries when used with a HGMF replicator. One such 
instrument replicates growth from each of the 1600 cells of the ISO-GRID and enables 
many copies of the master HGMF to be made (Peterkin et al, 1987). 

15 Sharpe et al. (1989) also used ISO-GRID HGMF form QA Laboratories and an 

automated HGMF counter (MM00 Interpreter) and RP-100 Replicator. They reported a 
technique for maintaining and screening many microbial cultures. 

Peterkin and colleagues later described a method for screening DNA probes using 
the hydrophobic grid-membrane filter (Peterkin et al, 1989). These authors reported 

20 methods for effective colony hybridization directly on HGMFs. Previously, poor results 
had been obtained due to the low DNA binding capacity of the epoxysulfone polymer on 
which the HGMFs are printed. However, Peterkin et al (1989) reported that the binding 
of DNA to the surface of the membrane was improved by treating the replicated and 
incubated HGMF with polyethyleneimine, a poly cation, prior to contact with DNA. 

25 Although this early work uses cellular DNA attachment, and has a different objective to 
the present invention, the methodology described may be readily adapted for Format 3 
SBH. 

In order to identify useful sequences rapidly, Peterkin et al. (1989) used 
radiolabeled plasmid DNA from various clones and tested its specificity against the DNA 
30 on the prepared HGMFs. In this way, DNA from recombinant plasmids was rapidly 
screened by colony hybridization against 100 organisms on HGMF replicates which can 
be easily and reproducibly prepared. 
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Manipulation with small (2-3 mm) chips, and parallel execution of thousands of 
the reactions. The solution of the invention is to keep the chips and the probes in the 
corresponding arrays. In one example, chips containing 250,000 9-mers are synthesized 
on a silicon wafer in the form of 8 x 8 mM plates (15 uM/oligonucleotide, Pease et al., 
5 1994) arrayed in 8 x 12 format (96 chips) with a 1 mM groove in between. Probes are 
added either by multichannel pipette or pin array, one probe on one chip. To score all 
4000 6-mers,^12 chip arrays have to be used, either using different ones, or by reusing 
one set of chip arrays several times. 

In the above case, using the earlier nomenclature of the application. F=9; P=6; 
10 and F + P = 15. Chips may have probes of formula BxNn, where x is a number of 
specified bases B; and n is a number of non-specified bases, so that x — 4 to 10 and n = 
1 to 4. To achieve more efficient hybridization, and to avoid potential influence of any 
support oligonucleotides, the specified bases can be surrounded by unspecified bases, thus 
represented by a formula such as (N)nBx(N)m. 

i 

15 In another embodiment of the chips, the substrate which supports the array of 

oligonucleotide probes is partitioned into sections so that each probe in the array is 
separated from adjacent probes by a physical barrier which may be, for example, a 
hydrophobic material. In a preferred embodiment, the physical barrier has a width of 
from 300 ^ m to 30 /an, and the distance between the center of each physical barrier to 

20 the center of adjacent physical barriers is at least 325 /xm. 

In a preferred embodiment, a hydrophobic material is deposited onto the substrate 
to form barriers of the desired width using an ink-jet head, coupled to an appropriate 
robotic system. For example a microdrop dosing head, that has been adapted to apply a 
suspension or solution of a desired hydrophobic material (e.g., an oil based material that 

25 forms a barrier after the solvent has evaporated), may be coupled with an anorad gantry 
system and fitted to an appropriate housing and dispensing system so that a grid of the 
hydrophobic material may be applied onto the desired substrate forming a plurality of 
wells on the substrate. After the grid of hydrophobic material has been formed, different 
probes are spotted onto each well (or mixtures of probes may be applied to each well) 

30 using a robotic system similar to that used to form the grid, but that has been adapted to 
apply solutions or suspensions of probes. In one embodiment, the same robotic system is 
used to apply the hydrophobic grid and the probes. In this embodiment, the dispensing 
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system is flushed after the hydrophobic grid is applied and then primed for delivery of 
probe. 

EXAMPLE 6 
5 Preparation of Support Bound Oligonucleotides 

Oligonucleotides, i.e., small nucleic acid segments, may be readily prepared by, 
for example, directly synthesizing the oligonucleotide by chemical means/as is commonly 
practiced using an automated oligonucleotide synthesizer. 

In general, oligonucleotides may be bound to a support through appropriate 

10 reactive groups. Such groups are well known in the art and include, for example, amino 
(-NH 2 ); hydroxy! (-OH); or carboxyl (C0 2 H) groups. Support bound oligonucleotides 
may be prepared by any of the methods known to those of skill in the art using any 
suitable support such as glass, polystyrene or Teflon. One strategy is to precisely spot 
oligonucleotides synthesized by standard synthesizers. Immobilization can be achieved by 

15 many methods, including, for example, using passive adsorption (Inouye & Hondo, 
1990): using UV light (Nagata et al t 1985; Dahlen et al, 1987; Morriey & Collins, 
1989); or by covalent binding of base modified DNA (Keller et al, 1988; 1989); or by 
formation of amide groups between the probe and the support (Wall et a/., 1995; Chebab 
et aL y 1992; and Zhang et al., 1991); all references being specifically incorporated 

20 herein. 

Another strategy that may be employed is the use of the strong biotin-streptavidin 
interaction as a linker. For example, Broude et al. (1994) describe the use of 
Biotinylated probes, although these are duplex probes, that are immobilized on 
streptavidin-coated magnetic beads. Streptavidin-coated beads may be purchased from 

25 Dynal, Oslo. Of course, this same linking chemistry is applicable to coating any surface 
with streptavidin. Biotinylated probes may be purchased from various sources, such as, 
e.g., Operon Technologies (Alameda, CA). 

Nunc Laboratories (Naperville, IL) is also selling suitable material that could be 
used. Nunc Laboratories have developed a method by which DNA can be covalently 

30 bound to the microwell surface termed Covalink NH. CovaLink NH is a polystyrene 
surface grafted with secondary amino groups ( > NH) that serve as bridge-heads for 
further covalent coupling. CovaLink Modules may be purchased from Nunc 
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Laboratories. DNA molecules may be bound to CovaLink exclusively at the 5<-end by a 
phosphoramidate bond, allowing immobilization of more than 1 pmol of DNA 

(Rasmussen et al. , 1991). 

The use of CovaLink NH strips for covalent binding of DNA molecules at the 
5 5'-end has been described (Rasmussen et al.. 1991). In this technology, a 

phosphoramidate bond is employed (Chu et al., 1983). This is beneficial as 
— immobilization usingonly a single covalent bond is-preferred. The phosphoramidate 

bond joins the DNA to the CovaLink NH secondary amino groups that are positioned at 

the end of spacer arms covalently grafted onto the polystyrene surface through a 2 nm 
10 long spacer arm. To link an oligonucleotide to CovaLink NH via an phosphoramidate 

bond, the oligonucleotide terminus must have a 5'-end phosphate group. It is, perhaps, 

even possible for biotin to be covalently bound to CovaLink and then streptavidin used to 

bind the probes. 

More specifically, the linkage method includes dissolving DNA in water (7.5 
15 ng/ul) and denaturing for 10 min. at 95°C and cooling on ice for 10 min. Ice-cold 0.1 M 
1-methylimidazole, pH 7.0 (1-Melm 7 ), is then added to a final concentration of 10 mM 
1-Melm 7 . A ss DNA solution is then dispensed into CovaLink NH strips (75 ul/well) 
standing on ice. 

Carbodiimide 0.2 M l-ethyl-3-(3-dimethylaminopropyl)-carbodiimide (EDC), 
20 dissolved in 10 mM 1-Melm 7 , is made fresh and 25 ul added per well. The strips are 
incubated for 5 hours at 50°C. After incubation the strips are washed using, e.g., 
Nunc-Immuno Wash: first the wells are washed 3 times, then they are soaked with 
washing solution for 5 min., and finally they are washed 3 times (where in the washing 
solution is 0.4 N NaOH, 0.25% SDS heated to 50°C). 
25 It is contemplated that a further suitable method for use with the present invention 

is that described in PCT Patent Application WO 90/03382 (Southern & Maskos), 
incorporated herein by reference. This method of preparing an oligonucleotide bound to a 
support involves attaching a nucleoside 3'-reagent through the phosphate group by a 
covalent phosphodiester link to aliphatic hydroxyl groups carried by the support. The 
30 oligonucleotide is then synthesized on the supported nucleoside and protecting groups 
removed from the synthetic oligonucleotide chain under standard conditions that do not 
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cleave the oligonucleotide from the support. Suitable reagents include nucleoside 
phosphoramidite and nucleoside hydrogen phosphorate. 

An on-chip strategy for the preparation of DNA probe for the preparation of DNA 
probe arrays may be employed. For example, addressable laser-activated 
5 photodeprotection may be employed in the chemical synthesis of oligonucleotides directly 
on a glass surface, as described by Fodor et al. (1991), incorporated herein by reference. 
Probes may also be immobilized on nylon supports as described by Van Ness et al 
(1991): or linked to Teflon using the method of Duncan & Cavalier (1988); all references 
being specifically incorporated herein. 
10 To link an oligonucleotide to a nylon support, as described by Van Ness et al 

(1991). requires activation of the nylon surface via alkylation and selective activation of 
the 5'-amine of oligonucleotides with cyanuric chloride. 

One particular way to prepare support bound oligonucleotides is to utilize the 
light-generated synthesis described by Pease et al., (1994, incorporated herein by 
15 reference). These authors used current photolithographic techniques to generate arrays of 
immobilized oligonucleotide probes (DNA chips). These methods, in which light is used 
to direct the synthesis of oligonucleotide probes in high-density, miniaturized arrays,, 
utilize photolabile 5'-protected W-acyl-deoxynucleoside phosphoramidites, surface linker 
chemistry and versatile combinatorial synthesis strategies. A matrix of 256 spatially 
20 defined oligonucleotide probes may be generated in this manner and then used in the 
advantageous Format 3 sequencing, as described herein. 

Of course, one could easily purchase a DNA chip, such as one of the 
light-activated chips described above, from a commercial source. In this regard, one may 
contact Affymetrix of Santa Clara, CA 95051, and Beckman. 
25 In a preferred embodiment, the probes of the invention include an informational 

portion (the portion which hybridizes to the target nucleic acid and gives sequence 
information) a reactive group to be attached to the substrate (solid support), and 
randomized positions, i.e., any of the four bases may be found at these positions. A 
preferred probe has the sequence 5'-(T) 6 -(N) 3 -(B) 5 , where T = thymine (binds to solid 
30 support), N = A, C, G, or T (randomized positions), and B = the five information 
positions of the probe (informational portion). In a most preferred embodiment, the 
probe may be bound to the support and a spacer moiety is found at the end of the probe 
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or internal to the probe and 5' of (N) 3 . The spacers may be comprised of atoms capable 
of forming at least two covalent bonds such as carbon, silicon, oxygen, sulfur, 
phosphorous, and the like, or may be comprised of molecules capable of forming at least 
two covalent bonds such as sugar-phosphate groups, amino acids, peptides, nucleosides, 
5 nucleotides, sugars, carbohydrates, aromatic rings, hydrocarbon rings, linear and 
branched hydrocarbons, and the like. 

EXAMPLE 7 

Preparation of Nucleic Acid Fragments 

10 The nucleic acids to be sequenced may be obtained from any appropriate source, 

such as cDNAs, genomic DNA, chromosomal DNA, microdissected chromosome bands, 
cosmid or YAC inserts, and RNA, including mRNA without any amplification steps. For 
example. Sambrook et al. (1989) describes three protocols for the isolation of high 
molecular weight DNA from mammalian cells (p. 9.14-9.23). 

15 Target nucleic acid fragments may be prepared as clones in M13, plasmid or 

lambda vectors and/or prepared directly from genomic DNA or cDNA by PCR or other 
amplification methods. Samples may be prepared or dispensed in multiwell plates. About 
100-1000 ng of DNA samples may be prepared in 2-500 ml of final volume. Target 
nucleic acids prepared by PCR may be directly applied to a substrate for Format I SBH 

20 without purification. Once the target nucleic acids are fixed to the substrate, the substrate 
may be washed or directly annealed with probes. 

The nucleic acids would then be fragmented by any of the methods known to those 
of skill in the an including, for example, using restriction enzymes as described at 
9.24-9.28 of Sambrook et al. (1989), shearing by ultrasound and NaOH treatment. 

25 Low pressure shearing is also appropriate, as described by Schriefer et al. (1990, 

incorporated herein by reference). In this method, DNA samples are passed through a 
small French pressure cell at a variety of low to intermediate pressures. A lever device 
allows controlled application of low to intermediate pressures to the cell. The results of 
these studies indicate that low-pressure shearing is a useful alternative to sonic and 

30 enzymatic DNA fragmentation methods. 

One particularly suitable way for fragmenting DNA is contemplated to be that 
using the two base recognition endonuclease, Cv/JI, described by Fitzgerald et al. (1992). 
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These authors described an approach for the rapid fragmentation and fractionation of 
DNA into particular sizes that they contemplated to be suitable for shotgun cloning and 
sequencing. The present inventor envisions that this will also be particularly useful for 
generating random, but relatively small, fragments of DNA for use in the present 
5 sequencing technology. 

The restriction endonuclease Cv/JI normally cleaves the recognition sequence 
PuGCPy between the G and C to leave blunt ends. Atypical reaction conditions, which 
alter the specificity of this enzyme (Cv/JI**), yield a quasi-random distribution of DNA 
fragments form the small molecule pUC19 (2688 base pairs). Fitzgerald et al (1992) 

10 quantitatively evaluated the randomness of this fragmentation strategy, using a Cvi JI** 
digest of pUC19 that was size fractionated by a rapid gel filtration method and directly 
ligated. without end repair, to a lac Z minus M13 cloning vector. Sequence analysis of 
76 clones showed that Cv/JI** restricts pyGCPy and PuGCPu, in addition to PuGCPy 
sites, and that new sequence data is accumulated at a rate consistent with random 

15 fragmentation. 

As reported in the literature, advantages of this approach compared to sonication 
and agarose gel fractionation include: smaller amounts of DNA are required (0.2-0.5 ug 

i 

instead of 2-5 ug); and fewer steps are involved (no preligation, end repair, chemical 
extraction, or agarose gel electrophoresis and elution are needed). These advantages are 

20 also proposed to be of use when preparing DNA for sequencing by Format 3. 

In a preferred embodiment, the "fragments" of the nucleic acid sample are 
prepared so that they cannot be ligated to each other. Such a pool of fragments may be 
obtained by treating the fragmented nucleic acids obtained by enzyme digestion or 
physical shearing, with a phosphatase (e.g., calf intestinal phosphatase). Alternatively, 

25 nonligatable fragments of the sample nucleic acid may be obtained by using random 
primers (e.g., N 5 -N 9 , where N = A, G, T, or C), which have no phosphate at their 5'- 
ends, in a Sanger-dideoxy sequencing reaction with the sample nucleic acid. This will 
produce fragments of DNA that have a complementary sequence to the target nucleic acid 
and that are terminated in a dideoxy residue and which cannot be ligated to other 

30 fragments. 

Irrespective of the manner in which the nucleic acid fragments are obtained or 
prepared, it is important to denature the DNA to give single stranded pieces available for 
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hybridization. This is achieved by incubating the DNA solution for 2-5 minutes at 
80-90°C. The solution is then cooled quickly to.2°C to prevent renaturation of the DNA 
fragments before they are contacted with the chip. 

5 EXAMPLE 8 

Preparation of DNA Arrays 

Arrays may be prepared by spotting DNA samples on a support such as a nylon 
membrane. Spotting may be performed by using arrays of metal pins (the positions of 
which correspond to an array of wells in a microtiter plate) to repeated by transfer of 

10 about 20 nl of a DNA solution to a nylon membrane. By offset printing, a density of dots 
higher than the density of the wells is achieved. One to 25 dots may be accommodated in 
1 mnr, depending on the type of label used. By avoiding spotting in some preselected 
number of rows and columns, separate subsets (subarrays) may be formed. Samples in 
one subarray may be the same genomic segment of DNA (or the same gene) from 

15 different individuals, or may be different, overlapped genomic clones. Each of the 

subarrays may represent replica spotting of the same samples. In one example, a selected 
gene segment may be amplified from 64 patients. For each patient, the amplified gene 
segment may be in one 96- well plate (all 96 wells containing the same sample). A plate 
for each of the 64 patients is prepared. By using a 96-pin device, all samples may be 

20 spotted on one 8 x 12 cm membrane. Subarrays may contain 64 samples, one from each 
patient. Where the 96 subarrays are identical, the dot span may be 1 mm 2 and there may 
be a 1 mm space between subarrays. 

Another approach is to use membranes or plates (available from NUNC, 
Naperville, Illinois) which may be partitioned by physical spacers e.g. a plastic grid 

25 molded over the membrane, the grid being similar to the sort of membrane applied to the 
bottom of multiwell plates, or hydrophobic strips. A fixed physical spacer is not preferred 
for imaging by exposure to flat phosphor-storage screens or x-ray films. 

EXAMPLE 9 
30 Hybridization and Scoring Process 

Labeled probes may be mixed with hybridization buffer and pipetted, preferably 
by multichannel pipettes, to the subarrays. To prevent mixing of the probes between 
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subairays (if there are no hydrophobic strips or physical barriers imprinted in the 
membrane), a corresponding plastic, metal or ceramic grid may be firmly pressed to the 
membrane. Also, the volume of the buffer may be reduced to about 1 ml or less per mm 2 . 
The concentration of the probes and hybridization conditions used may be as described 
5 previously except that the washing buffer may be quickly poured over the array of 

subarrays to allow fast dilution of probes and thus prevent significant cross-hybridization. 
For the same reason, a minimal concentration of the probes may be used and 
hybridization time extended to the maximal practical level. For DNA detection and 
sequencing, knowledge of a "normal" sequence allows the use of the continuous stacking 

10 interaction phenomenon to increase the signal. In addition to the labelled probe, additional 
unlabelled probes which hybridize back to back with a labelled one may be added in the 
hybridization reaction. The amount of the hybrid may be increased several times. The 
probes may be connected by ligation. This approach may be important for resolving 
DNA regions forming "compressions". 

15 In the case of radiolabeled probes, images of the filters may be obtained, 

preferably by phosphorstorage technology. Fluorescent labels may be scored by CCD 

* 

cameras, confocal microscopy or otherwise. In order to properly scale and integrate data 
from different hybridization experiments, raw signals are normalized based on the amount 
of target in each dot. Differences in the amount of target DNA per dot may be corrected 

20 for by dividing signals of each probe by an average signal for all probes scored on one 
dot. The normalized signals may be scaled, usually from 1-100, to compare data from 
different experiments. Also, in each subarray. several control DNAs may be used to 
determine an average background signal in those samples which do not contain a full 
match target. For samples obtained from diploid (polyploid) scores, homozygotic controls 

25 may be used to allow recognition of heterozygotes in the samples. 

EXAMPLE 10 

Hybridization With Oligonucleotides 

Oligonucleotides were either purchased from Genosys Inc., Houston. Texas or 
30 made on an Applied Biosystems 381A DNA synthesizer. Most of the probes used were 
not purified by HPLC or gel electrophoresis. For example, probes were designed to have 
both a single perfectly complementary target in interferon, a Ml 3 clone containing a 921 

- 38 - 



WO 99/09217 



PCT/US98/16966 



bp Eco RI-Bgl II human Bl - interferon fragment (Ohno and Tangiuchi, Proc. Natl. 
Acad. ScL 74: 4370-4374 (1981)], and at least one target with an end base mismatch in 
M13 vector itself. 

End labeling of oligonucleotides was performed as described [Maniatis et al., 
5 Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory Cold Spring 
Harbor, New York (1982)] in 10 ml containing T4-poly nucleotide kinase (5 units 
~Amersham), y^-ATP (3.3 pM, 10 mCi Amersham 3000 Ci/mM) and oligonucleotide (4 
pM, 10 ng). Specific activities of the probes were 2,5-5 X 10 9 cpm/nM. 

Single stranded DNA (2 to 4 ml in 0.5 NaOH, 1.5 M NaCl) was spotted on a 

10 Gene Screen membrane wetted with the same solution, the filters were neutralized in 0.05 
M Na 2 HP0 4 pH 6.5, baked in an oven at 80°C for 60 min. and UV irradiated for 1 min. 
Then, the filters were incubated in hybridization solution (0.5 M Na 2 HP0 4 pH 7.2, 7% 
sodium lauroyl sarcosine for 5 min at room temperature and placed on the surface of a 
plastic Petri dish. A drop of hybridization solution (10 ml, 0.5 M Na 2 HP0 4 pH 7.2, 7% 

15 sodium lauroyl sarcosine) with a 32 P end-labeled oligomer probe at 4 nM concentration 
was placed over 1-6 dots per filter, overlaid with a square piece of polyethylene 
(approximately 1 X 1 cm.), and incubated in a moist chamber at the indicated 
temperatures for 3 hr. Hybridization was stopped by placing the filter in 6X SSC 
washing solution for 3 X 5 minute at 0°C to remove unhybridized probe. The filter was 

20 either dried, or further washed for the indicated times and temperatures, and 

autoradiographed. For discrimination measurements, the dots were excised from the 
dried filters after autoradiography [a phosphoimager (Molecular Dynamics. Sunnyvale, 
California) may be used] placed in liquid scintillation cocktail and counted. The 
uncorrected ratio of cpms for IF and M13 dots is given as D. 

25 The conditions reported herein allow hybridization with very short oligonucleotides 

but ensure discriminations between matched and mismatched oligonucleotides that are 
complementary to and therefore bind to a target nucleic acid. Factors which influence the 
efficient detection of hybridization of specific short sequences based on the degree of 
discriminations (D) between a perfectly complementary target and an imperfectly 

30 complementary target with a single mismatch in the hybrid are defined. In experimental 
tests, dot blot hybridization of twenty-eight probes that were 6 to 8 nucleotides in length 
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to two M13 clones or to model oligonucleotides bound to membrane filters was 
accomplished. The principles guiding the experimental procedures are given below. 

Oligonucleotide hybridization to filter bound target nucleic acids only a few 
nucleotides longer than the probe in conditions of probe excess is a pseudo-first order 
5 reaction with respect to target concentration. This reaction is defined by: 
S t /S 0 = e- tt ™' 

Wherein S t and S 0 are target sequence concentrations at time t and t^ respectively. (OP) 
is probe concentration and t is temperature. The rate constant for hybrid formation, kh 
increases only slightly in the 0°C to 30°C range (Porschke and Eigen, /. Mol BioL 62: 
10 361 (1971); Craig etal., J. Mol. Biol. 62 : 383 (1971)]. Hybrid melting is a first order 
reaction with respect to hybrid concentration (here replaced by mass due to filter bound 
state) as shown in: 

H t /H 0 = e" knu 

In this equation, H t and H 0 are hybrid concentrations at times t and t 0 , 
15 respectively; k^ is a rate constant for hybrid melting which is dependent on temperature 
and salt concentration [Ikuta et al., NucL Acids Res. 15: 797 (1987); Porsclike and Eigen, 
J. Mol. BioL 62: 361 (1971); Craig etal., J. Mol. Biol. 62: 303 (1971)]. During 
hybridization, which is a strand association process, the back, melting, or strand 
dissociation, reaction takes place as well. Thus, the amount of hybrid formed in time is 
20 result of forward and back reactions. The equilibrium may be moved towards hybrid 
formation by increasing probe concentration and/or decreasing temperature. However, 
during washing cycles in large volumes of buffer, the melting reaction is dominant and 
the back reaction hybridization is insignificant, since the probe is absent. This analysis 
indicates workable Short Oligonucleotide Hybridization (SOH) conditions call be varied 
25 for probe concentration or temperature. 

D or discrimination is defined in equation four: 

D = H p (O / Hi (U 
H p (^ ) and Hj (O are the amounts hybrids remaining after a washing time, t w , for 
the identical amounts of perfectly and imperfectly complementary duplex, respectively. 
30 For a given temperature, the discrimination D changes with the 10 length of washing time 
and reaches the maximal value when Hj = B which is equation five. 



-40- 



WO 99/09217 



PCT/US98/16966 



The background, B, represents the lowest hybridization signal detectable in the 
system. Since any further decrease of Hj may not be examined, D increases upon 
continued washing. Washing past ^ just decreases H p relative to B, and is seen as a 
decrease in D. The optimal washing time, t w , for imperfect hybrids, from equation three 
5 and equation five is: 

U = -In (B / H, (to))/ 
Since Hp is being washed for the same ^ , combining equations, one obtains the 
optimal discrimination function: 

D = e taCB/Hi (l O))km.p/km.i X jj^g / g 

10 The change of D as a function, of T is important because of the choice of an 

optimal washing temperature. It is obtained by substituting the Arhenius equation which 
is: 

K- = Ae"* 3 '* 7 
into the previous equation to form the final equation: 
15 D = H p ((g/B X (B/Hj (to)) <a p ' Ai > c < E a,i * E a,p )/RT ; 

Wherein B is less than Hi (to). 

Since the activation energy for perfect hybrids, E ap , and the activation energy for 
imperfect hybrids, E^ it can be either equal, or E^ less than E a p D is temperature 
independent, or decreases with increasing temperature, respectively. This result implies 

20 that the search for stringent temperature conditions for good discrimination in SOH is 
unjustified. By washing at lower temperatures, one obtains equal or better discrimination, 
but the time of washing exponentially increases with the decrease of temperature. 
Discrimination more strongly decreases with T, if H/to) increases relative to H p (to). 

D at lower temperatures depends to a higher degree on the H p (to)/B ratio than on 

25 the Hp (to) / Hi (to) ratio. This result indicates that it is better to obtain a sufficient 

quantity of H p in the hybridization regardless of the discrimination that can be achieved in 
this step. Better discrimination can then be obtained by washing, since the higher 
amounts of perfect hybrid allow more time for differential melting to show an effect. 
Similarly, using larger amounts of target nucleic acid a necessary discrimination can be 

30 obtained even with small differences between and K^. 

Extrapolated to a more complex situation than covered in this simple model, the 
result is that washing at lower temperatures is even more important for obtaining 
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discrimination in the case of hybridization of a probe having many end-mismatches within 
a given nucleic acid target. 

Using the described theoretical principles as a guide for experiments, reliable 
hybridizations have been obtained with probes six to eight nucleotides in length. All 
5 experiments were performed with a floating plastic sheet providing a film of hybridization 
solution above the filter. This procedure allows maximal reduction in the amount of 
probe, and thus reduced label costs in dot blot hybridizations. " The highlroncentration of 
sodium lauroyl sarcosine instead of sodium lauroyl sulfate in the phosphate hybridization 
buffer allows dropping the reaction from room temperature down to 12°C. Similarly, the 

10 4-6 X SSC, 10% sodium lauroyl sarcosine buffer allows hybridization at temperatures as 
low as 2°C. The detergent in these buffers is for obtaining tolerable background with up 
to 40 nM concentrations of labelled probe. Preliminary characterization of the thermal 
stability of short oligonucleotide hybrids was determined on a prototype octamer with 
50% G+C content, i.e. probe of sequence TGCTCATG. The theoretical expectation is 

15 that this probe is among the less stable octamers. Its transition enthalpy is similar to 
those of more stable heptamers or, even to probes 6 nucleotides in length (Bresslauer et 
al., Proc. Nail Acad. ScL U.S.A. 83: 3746 (1986)). Parameter T d , the temperature at 
which 50% of the hybrid is melted in unit time of a minute is 18°C. The result shows 
that T d is 15°C lower for the 8 bp hybrid than for an 11 bp duplex [Wallace et al., 

20 Nucleic Acids Res. 6: 3543 (1979)]. 

In addition to experiments with model oligonucleotides, an M13 vector was chosen 
as a system for a practical demonstration of short oligonucleotide hybridization. The 
main aim was to show useful end-mismatch discrimination with a target similar to the 
ones which will be used in various applications of the method of the invention. 

25 Oligonucleotide probes for the M13 model were chosen in such a way that the M 13 
vector itself contains the end mismatched base. Vector IF, an M13 recombinant 
containing a 921 bp human interferon gene insert, carries single perfectly matched target. 
Thus, IF has either the identical or a higher number of mismatched targets in comparison 
to the M13 vector itself. 

30 Using low temperature conditions and dot blots, sufficient differences in 

hybridization signals were obtained between tie dot containing the perfect and the 
mismatched targets and the dot containing the mismatched targets only. This was true for 
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the 6-mer oligonucleotides and was also true for the 7 and 8-mer oligonucleotides 
hybridized to the large EF-M13 pair of nucleic acids. 

The hybridization signal depends on the amount of target available on the filter for 
reaction with the probe. A necessary control is to show that the difference in sign 
5 intensity is not a reflection of varying amounts of nucleic acid in the two dots. 

Hybridization with a probe that has the same number and kind of targets in both IF and 
M13 shows that there is an equal amount of DNA in the dots. Since the efficiency of 
hybrid formation increases with hybrid length, the signal for a duplex having six 
nucleotides was best detected with a high mass of oligonucleotide target bound to the 

10 filter. Due to their lower molecular weight, a larger number of oligonucleotide target 
molecules can be bound to a given surface area when compared to large molecules of 
nucleic acid that serves as target. 

To measure the sensitivity of detection with unpurified DNA, various amounts of 
phage supernatants were spotted on the filter and hybridized with a 32 P-labelled octamer. 

15 As little as 50 million unpurified phage containing no more than 0.5 ng of DNA gave a 
detectable signal indicating that sensitivity of the short oligonucleotide hybridization 
method is sufficient. Reaction time is short, adding to the practicality. 

As mentioned in the theoretical section above, the equilibrium yield of hybrid 
depends oil probe concentration and/or temperature of reaction. For instance, the signal 

20 level for the same amount of target with 4 nM octamer at 13°C is 3 times lower than with 
a probe concentration of 40 nM, and is decreased 4.5-times by raising the hybridization 
temperature to 25°C. 

The utility of the low temperature wash for achieving maximal discrimination is 
demonstrated. To make the phenomenon visually obvious, 50 times more DNA was put in 

25 the Ml 3 dot than in the IF dot using hybridization with a vector specific probe. In this 
way, the signal after the hybridization step with the actual probe was made stronger in 
the, mismatched that in the matched case. The Hp fH h ratio was 1:4. Inversion of signal 
intensities after prolonged washing at 7°C was achieved without a massive loss of perfect 
hybrid, resulting in a ratio of 2:1. In contrast, it is impossible to achieve any 

30 discrimination at 25°C, since the matched target signal is already brought down to the 
background level with 2 minute washing; at the same time, the signal from the 
mismatched hybrid is still detectable. The loss of discrimination at 13°C compared to 7°C 
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is not so great but is clearly visible. If one considers the 90 minute point at 7°C and the 
15 minute point at 13°C when, the mismatched hybrid signal is near the background 
level, which represents optimal washing times for the respective conditions, it is obvious 
that the amount of several times greater at 7°C than at 13°C. To illustrate this further, 
5 the time course of the change discrimination with washing of the same amount of starting 
hybrid at the two temperatures shows the higher maximal D at the lower temperature. 
These results confirm the trend in the change of D with temperature and the ratio of 
amounts of the two types of hybrid at the start of the washing step. 

In order to show the general utility of the short oligonucleotide hybridization 

10 conditions, we have looked hybridization of 4 heptamers, 10 octamers and an additional 
14 probes up to 12 nucleotides in length in our simple Ml 3 system. These include-the 
nonamer GTTTTTTAA and octamer GGCAGGCG representing the two extremes of GC 
content. Although GC content and sequence are expected to influence the stability of 
short hybrids [Bresslauer et al., Proc. Natl Acad. Set U.S.A. 83: 3746 (1986)], the 

15 low temperature short oligonucleotide conditions were applicable to all tested probes in 
achieving sufficient discrimination. Since the best discrimination value obtained with . 
probes 13 nucleotides in length was 20, a several fold drop due to sequence variation is 
easily tolerated. 

The M13 system has the advantage of showing the effects of target DNA 
20 complexity on the levels of discrimination. For two octamers having either none or five 

mismatched targets and differing in only one GC pair the observed discriminations were 

18.3 and 1.7, respectively. 

In order to show the utility of this method, three probes 8 nucleotides in length 

were tested on a collection of 51 plasmid DNA dots made from a library in Bluescript 
25 vector. One probe was present and specific for Bluescript vector but was absent in M13, 

while the other two probes had targets that were inserts of known sequence. This system 

allowed the use of hybridization negative or positive control DNAs with each probe. This 

probe sequence (CTCCCTTT) also had a complementary target in the interferon insert. 

Since the M13 dot is negative while the interferon insert in either M13 or Bluescript was 
30 positive, the hybridization is sequence specific. Similarly, probes that detect the target 

sequence in only one of 51 inserts, or in none of the examined inserts along with controls 
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that confirm that hybridization would have occurred if the appropriate targets were 
present in the clones. 

Thermal stability curves for very short oligonucleotide hybrids that are 6-8 
nucleotides in length are at least 15°C lower than for hybrids 11-12 nucleotides in length 
5 [Fig. 1 and Wallace et al., Nucleic Acids Res, 6: 3543-3557 (1979)]. However, 
performing the hybridization reaction at a low temperature and with a very practical 
0.4-40 nM concentration of oligonucleotide probe allows the detection of complementary 
sequence in a known or unknown nucleic acid target To determine an unknown nucleic 
acid sequence completely, an entire set containing 65,535 8-mer probes may be used. 

10 Sufficient amounts of nucleic acid for this purpose are present in convenient biological 
samples such as a few microliters of M13 culture, a plasmid prep from 10 ml of bacterial 
culture or a single colony of bacteria, or less than 1 ml of a standard PCR reaction. 

Short oligonucleotides 6-10 nucleotides long give excellent discrimination. The 
relative decrease in hybrid stability with a single end mismatch is greater than for longer 

15 probes. Results with the octamer TGCTCATG support this conclusion. In the 

experiments, the target with a G/T end mismatch, hybridization to the target of this type 
of mismatch is the most stable of all other types of oligonucleotide. This discrimination 
achieved is the same as or greater than an internal G/T mismatch in a 19 base paired 
duplex greater than an internal G/T mismatch in a 19 paired duplex [Ikuta et al., Nucl. 

20 Acids res. 15: 797 (1987)]. Exploiting these discrimination properties using the described 
hybridization conditions for short oligonucleotide hybridization allows a very precise 
determination of oligonucleotide targets. In contrast to the ease of detecting discrimination 
between perfect and imperfect hybrids, a problem that may exist with using very short 
oligonucleotides is the preparation of sufficient amounts of hybrids. In practice, the need 

25 to discriminate H p and H t is aided by increasing the amount of DNA in the dot and/or the 
probe concentration, or by decreasing the hybridization temperature. However, higher 
probe concentrations usually increase background. Moreover, there are limits to the 
amounts of target nucleic acid that are practical to use. This problems was solved by the 
higher concentration of the detergent Sarcosyl which gave an effective background with 4 

30 nM of probe. Further improvements may be effected either in the use of competitors for 
unspecific binding of probe to filter, or by changing the hybridization support material. 
Moreover, for probes having E a less than 45 Kcal/mol (e.g. for many heptamers and a 



-45 - 



WO 99/09217 



PCT/US9Syi6966 



majority of hexamers, modified oligonucleotides give a more stable hybrid [Asseline, et 
al., Proc. Nat'l Acad. ScL 81; 3297 (1984)] than their unmodified counterparts. The 
hybridization conditions described in this invention for short oligonucleotide hybridization 
using low temperatures give better discriminating for all sequences and duplex hybrid 
5 inputs. The only price paid in achieving uniformity in hybridization conditions for 
different sequences is an increase in washing time from minutes to up to 24 hours 
depending on the sequence. Moreover, the washing time can be further reduced by 
decreasing the salt concentration. 

Although there is excellent discrimination of one matched hybrid over a 
10 mismatched hybrids, in short oligonucleotide hybridization, signals from mismatched 
hybrids exist, with the majority of the mismatch hybrids resulting from end mismatch. 
This may limit insert sizes that may be effectively examined by a probe of a certain 
length. 

The influence of sequence complexity on discrimination cannot be ignored. 

15 However, the complexity effects are more significant when defining sequence information 
by short oligonucleotide hybridization for specific, nonrandom sequences, and can be ■ 
overcome by using an appropriate probe to target length ratio. The length ratio is chosen 
to make unlikely, on statistical grounds, the occurrence of specific sequences which have 
a number of end-mismatches which would be able to eliminate or falsely invert 

20 discrimination. Results suggest the use of oligonucleotides 6, 7, and 8 nucleotides in 
length on target nucleic acid inserts shorter than 0.6, 2.5, and 10 kb, respectively. 

EXAMPLE 11 
DNA Sequencing 

25 An array of subarrays allows for efficient sequencing of a small set of samples 

arrayed in the form of replicated subarrays; For example, 64 samples may be arrayed on 
a 8 X 8 mm subarray and 16 X 24 subarrays may be replicated on a 15 X 23 cm 
membrane with 1 mm wide spacers between the subarrays. Several replica membranes 
may be made. For example, probes from a universal set of three thousand seventy-two 

30 7-mers may be divided in thirty-two 96- well plates and labelled by kinasing. Four 
membranes may be processed in parallel during one hybridization cycle. On each 
membrane, 384 probes may be scored. All probes may be scored in two hybridization 
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cycles. Hybridization intensities may be scored and the sequence assembled as described 
below. 

If a single sample subarray or subarrays contains several unknowns, especially 
when similar samples are used, a smaller number of probes may be sufficient if they are 
5 intelligently selected on the basis of results of previously scored probes. For example, if 
probe AAAAAAA is not positive, there is a small chance that any of 8 overlapping 
probes are positive. If AAAAAAA is positive, then two probes are usually positive. The 
sequencing process in this case consists of first hybridizing a subset of minimally 
overlapped probes to define positive anchors and then to successively select probes which 
10 confirms one of the most likely hypotheses about the order of anchors and size and type 
of gaps between them. In this second phase, pools of 2-10 probes may be used where 
each probe is selected to be positive in only one DNA sample which is different from the 
samples expected to be positive with other probes from the pool. 

The subarray approach allows efficient implementation of probe competition 
15 (overlapped probes) or probe cooperation (continuous stacking of probes) in solving 
branching problems. After hybridization of a universal set of probes the sequence 
assembly program determines candidate sequence subfragments (SFs). For the further 
assembly of SFs, additional information has to be provided (from overlapped sequences of 
DNA fragments, similar sequences, single pass gel sequences, or from other hybridization 
20 or restriction mapping data). Primers for single pass gel sequencing through the branch 
points are identified from the SBH sequence information or from known vector 
sequences, e.g., the flanking sequences to the vector insert site, and standard Sanger- 
sequencing reactions are performed on the sample DNA. The sequence obtained from 
this single pass gel sequencing is compared to the Sfs that read into and out of the branch 
25 points to identify the order of the Sfs. Further, singe pass gel sequencing may be 
combined with SBH to de novo sequence or re-sequence a nucleic acid. 

Competitive hybridization and continuous stacking interactions can also be used to 
assemble Sfs. These approaches are of limited commercial value for sequencing of large 
numbers of samples by SBH wherein a labelled probe is applied to a sample affixed to an 
30 array if a uniform array is used. Fortunately, analysis of small numbers of samples using 
replica subarrays allows efficient implementation of both approaches. On each of the 
replica subarrays, one branching point may be tested for one or more DNA samples using 
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pools of probes similarly as in solving mutated sequences in different samples spotted in 
the same subarray (see above). 

If in each of 64 samples described in this example, there are about 100 branching 
points, and if 8 samples are analyzed in parallel in each subarray, then at least 800 
5 subarray probings solve all branches. This means that for the 3072 basic probings an 
additional 800 probings (25%) are employed. More preferably, two probings are used for 
one branching point. If the subarrays are smaller, less additional probings are used. For 
example, if subarrays consist of 16 samples, 200 additional probings may be scored (6%). 
By using 7-mer probes (N^N^ and competitive or collaborative branching solving 
10 approaches or both, fragments of about 1000 bp fragments may be assembled by about 
4000 probings. Furthermore, using 8-mer probes (NB 8 N) 4 kb or longer fragments may 
be assembled with 12,000 probings. Gapped probes, for example, NB 4 NB3N or 
NB 4 NB 4 N may be used to reduce the number of branching points. 

15 EXAMPLE 12 

DNA Analysis bv Transient Attachment to Subarravs 
of Probes and Ligation of Labelled Probes 

Oligonucleotide probes having an informative length of four to 40 bases are 

synthesized by standard chemistry and stored in tubes or in multiwell plates. Specific sets 

20 of probes comprising one to 10,000 probes are arrayed by deposition or in situ synthesis 
on separate supports or distinct sections of a larger support. In the last case, sections or 
subarrays may be separated by physical or hydrophobic barriers. The probe arrays may 
be prepared by in situ synthesis. A sample DNA of appropriate size is hybridized with 
one or more specific arrays. Many samples may be interrogated as pools at the same 

25 subarrays or independently with different subarrays within one support. Simultaneously 
with the sample or subsequently, a single labelled probe or a pool of labelled probes is 
added on each of the subarrays. If attached and labelled probes hybridize back to back on 
the complementary target in the sample DNA they are ligated. Occurrence of ligation will 
be measured by detecting a label from the probe. 

30 This procedure is a variant of the described DNA analysis process in which DNA 

samples are not permanently attached to the support. Transient attachment is provided by 
probes fixed to the support. In this case there is no need for a target DNA arraying 
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process. In addition, ligation allows detection of longer oligonucleotide sequences by 
combining short labelled probes with short fixed probes. 

The process has several unique features. Basically, the transient attachment of the 
target allows its reuse. After ligation occur the target may be released and the label will 
5 stay covalently attached to the support. This feature allows cycling the target and 
production of detectable signal with a small quantity of the target. Under optimal 
wnditions, targets do not need to be amplified, e.g. natural sources of the DNA samples 
may be directly used for diagnostics and sequencing purposes. Targets may be released by 
cycling the temperature between efficient hybridization and efficient melting of duplexes. 
10 More preferably, there is no cycling. The temperature and concentrations of components 
may be defined to have an equilibrium between free targets and targets entered in hybrids 
at about 50:50% level. In this case there is a continuous production of ligated products. 
For different purposes different equilibrium ratios are optimal. 

An electric field may be used to enhance target use. At the beginning, a horizontal 
15 field pulsing within each subarray may be employed to provide for faster target sorting. 
In this phase, the equilibrium is moved toward hybrid formation, and unlabelled probes 
may be used. After a target sorting phase, an appropriate washing (which may be helped 
by a vertical electric field for restricting movement of the samples) may be performed. 
Several cycles of discriminative hybrid melting, target harvesting by hybridization and 
20 ligation and removing of unused targets may be introduced to increase specificity. In the 
next step, labelled probes are added and vertical electrical pulses may be applied. By 
increasing temperature, an optimal free and hybridized target ratio may be achieved. The 
vertical electric field prevents diffusion of the sorted targets. 

The subarrays of fixed probes and sets of labelled probes (specially designed or 
25 selected from a universal probe set) may be arranged in various ways to allow an efficient 
and flexible sequencing and diagnostics process. For example, if a short fragment (about 
100-500 bp) of a bacterial genome is to be partially or completely sequenced, small arrays 
of probes (5-30 bases in length) designed on the bases of known sequence may be used. 
If interrogated with a different pool of 10 labelled probes per subarray, an array of 10 
30 subarrays each having 10 probes, allows checking of 200 bases, assuming that only two 
bases connected by ligation are scored. Under the conditions where mismatches are 
discriininated throughout the hybrid, probes may be displaced by more than one base to 
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cover the longer target with the same number of probes. By using long probes, the target 
may be interrogated directly without amplification or isolation from the rest of DNA in 
the sample. Also, several targets may be analyzed (screened for) in one sample 
simultaneously. If the obtained results indicate occurrence of a mutation (or a pathogen), 
5 additional pools of probes may be used to detect type of the mutation or subtype of 
pathogen. This is a desirable feature of the process which may be very cost effective in 
preventive diagnosis where only a small fraction of patients is expected to have an 
infection or mutation. 

In the processes described in the examples, various detection methods may be 
10 used, for example, radiolabels, fluorescent labels, enzymes or antibodies 

(chemiluminescence), large molecules or particles detectable by light scattering or 
interferometric procedures. 

EXAMPLE 13 
15 Sequencing a Target Using Octamers and Nonamers 

Data resulting from the hybridization of octamer and nonamer oligonucleotides 

shows that sequencing by hybridization provides an extremely high degree of accuracy. 

In this experiment, a known sequence was used to predict a series of contiguous 

overlapping component octamer and nonamer oligonucleotides. 
20 In addition to the perfectly matching oligonucleotides, mismatch oligonucleotides, 

mismatch oligonucleotides wherein internal or end mismatches occur in the duplex formed 

by the oligonucleotide and the target were examined. In these analyses, the lowest 

practical temperature was used to maximize hybridization formation. Washes were 

accomplished at the same or lower temperatures to ensure maximal discrimination by 
25 utilizing the greater dissociation rate of mismatch versus matched oligonucleotide/target 

hybridization. These conditions are shown to be applicable to all sequences although the 

absolute hybridization yield is shown to be sequence dependent. 

The least destabilizing mismatch that can be postulated is a simple end mismatch, 

so that the test of sequencing by hybridization is the ability to discriminate perfectiy 
30 matched oligonucleotide/target duplexes from end-mismatched oligonucleotide/target 

duplexes. 
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The discriminative values for 102 of 105 hybridizing oligonucleotides in a dot blot 
format were greater than 2 allowing a highly accurate generation of the sequence. This 
system also allowed an analysis of the effect of sequence on hybridization formation and 

hybridization instability. 

5 One hundred base pairs of a known portion of a human-interferon genes prepared 

by PCR, i.e. a 100 bp target sequence, was generated with data resulting from the 
hybridization of 105 oligonucleotides probes of known sequence to the target nucleic acid. 
The oligonucleotide probes used included 72 octamer and 21 nonamer oligonucleotides 
whose sequence was perfectly complementary to the target. The set of 93 probes 

10 provided consecutive overlapping frames of the target sequence e displaced by one or two 
bases. 

To evaluate the effect of mismatches, hybridization was examined for 12 additional 
probes that contained at least one end mismatch when hybridized to the 100 bp test target 
sequence. Also tested was the hybridization of twelve probes with target end-mismatched 

15 to four other control nucleic acid sequences chosen so that the 12 oligonucleotides formed 
perfectly matched duplex hybrids with the four control DNAs. Thus, the hybridization of 
internal mismatched, end-mismatched and perfectly matched duplex pairs of 
oligonucleotide and target were evaluated for each oligonucleotide used in the experiment. 
The effect of absolute DNA target concentration on the hybridization with the test 

20 octamer and nonamer oligonucleotides was determined by defining target DNA 

concentration by detecting hybridization of a different oligonucleotide probe to a single 
occurrence non- target site within the co-amplified plasmid DNA. 

The results of this experiment showed that all oligonucleotides containing perfect 
matching complementary sequence to the target or control DNA hybridized more strongly 

25 than those oligonucleotides having mismatches. To come to this conclusion, we examined 
Hp and D values for each probe. H p defines the amount of hybrid duplex formed between 
a test target and an oligonucleotide probe. By assigning values of between 0 and 10 to 
the hybridization obtained for the 105 probes, it was apparent that 68.5% of the 105 
probes had an H p greater than 2. 

30 Discrimination (D) values were obtained where D was defined as the ratio of 

signal intensities between 1) the dot containing a perfect matched duplex formed between 
test oligonucleotide and target or control nucleic acid and 2) the dot containing a 
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mismatch duplex formed between the same oligonucleotide and a different site within the 
target or control nucleic acid. Variations in the value of D result from either 1) 
perturbations in the hybridization efficiency which allows visualization of signal over 
background, or 2) the type of mismatch found between the test oligonucleotide and the 
5 target. The D values obtained in this experiment were between 2 and 40 for 102 of the 
105 oligonucleotide probes examined. Calculations of D for the group of 102 
oligonucleotides as a whole showed the average D was 10.6. 

There were 20 cases where oligonucleotide/target duplexes exhibited an 
end-mismatch. In five of these, D was greater than 10. The large D value in these cases 

10 is most likely due to hybridization destabilization caused by other than the most stable 
(G/T and G/A) end mismatches. The other possibility is there was an error in the 
sequence of either the oligonucleotides or the target. 

Error in the target for probes with low H p was excluded as a possibility because 
such an error would have affected the hybridization of each of the other eight overlapping 

15 oligonucleotides. There was no apparent instability due to sequence mismatch for the 
other overlapping oligonucleotides, indicating the target sequence was correct. Error in 
the oligonucleotide sequence was excluded as a possibility after the hybridization of seven 
newly synthesized oligonucleotides was re-examined. Only 1 of the seven 
oligonucleotides resulted in a better D value. Low hybrid formation values may result 

20 from hybrid instability or from an inability to form hybrid duplex. An inability to form 
hybrid duplexes would result from either 1) self complementarity of the chosen probe or 
2) target/target self hybridization. Oligonucleotide/oligonucleotide duplex formation may 
be favored over oligonucleotide/target hybrid duplex formation if the probe was 
self-complementary. Similarly, target/target association may be favored if the target was 

25 self-complementary or may form internal palindromes. In evaluating these possibilities, it 
was apparent from probe analysis that the questionable probes did not form hybrids with 
themselves. Moreover, in examining the contribution of target/target hybridization, it 
was determined that one of the questionable oligonucleotide probes hybridized 
inefficiently with two different DNAs containing the same target. The low probability 

30 that two different DNAs have a self-complementary region for the same target sequence 
leads to the conclusion that target/target hybridization did not contribute to low 
hybridization formation. Thus, these results indicate that hybrid instability and not the 
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inability to form hybrids was the cause of the low hybrid formation observed for specific 
oligonucleotides. The results also indicate that low hybrid formation is due to the specific 
sequences of certain oligonucleotides. Moreover, the results indicate that reliable results 
may be obtained to generate sequences if octamer and nonamer oligonucleotides are used. 
5 These results show that using the methods described long sequences of any specific 

target nucleic acid may begenerated by maximal and unique overlap of constituent 
oligonucleotidesT Such sequencing methods are dependent on the content of the individual 
component oligomers regardless of their frequency and their position. 

The sequence which is generated using the algorithm described below is of high 
10 fidelity. The algorithm tolerates false positive signals from the hybridization dots as is 
indicated from the fact the sequence generated from the 105 hybridization values, which 
included four less reliable values, was correct. This fidelity in sequencing by 
hybridization is due to the "all or none" kinetics of short oligonucleotide hybridization 
and the difference in duplex stability that exists between perfectly matched duplexes and 
15 mismatched duplexes. The ratio of duplex stability of matched and end-mismatched 
duplexes increases with decreasing duplex length. Moreover, binding energy decreases 
with decreasing duplex length resulting in a lower hybridization efficiency. However, the 
results provided show that octamer hybridization allows the balancing of the factors 
affecting duplex stability and discrimination to produce a highly accurate method of 
20 sequencing by hybridization. Results presented in other examples show that 

oligonucleotides that are 6, 7, or 8 nucleotides can be effectively used to generate reliable 
sequence on targets that are 0.5 kb (for hexamers) 2 kb (for septamers) and 6kb (for 
octamers). The sequence of long fragments may be overlapped to generate a complete 
genome sequence. 

25 

EXAMPLE 14 

Analyzing the Data Obtained 

Image files are analyzed by an image analysis program, like DOTS program 
(Drmanac et al, 1993), and scaled and evaluated by statistical functions included, e.g., in 
30 SCORES program (Drmanac et al. 1994). From the distribution of the signals an optimal 
threshold is determined for transforming signal into +/- output. From the position of the 
label detected, F + P nucleotide sequences from the fragments would be determined by 
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combining the known sequences of the immobilized and labeled probes corresponding to 
the labeled positions. The complete nucleic acid sequence or sequence subfragments of 
the original molecule, such as a human chromosome, would then be assembled from the 
overlapping F + P sequence determined by computational deduction. 
5 One option is to transform hybridization signals e.g., scores, into +/- output 

during the sequence assembly process. In this case, assembly will start with a F + P 
sequence with a very high score, for example F + P sequence AAAAAATTTTTT. 
Scores of all four possible overlapping probes AAAAATTTTTTA , AAAAATTTTTT , 
AAAAATTTXTTC and AAAAATTTTTTG and three additional probes that are different 

10 at the beginning (T AAAAATTTTTT, ; C AAAAATTTTTT, ; G AAAAATTTTTT, are 
compared and three outcomes defined: (i) only the starting probe and only one of the four 
overlapping proves have scores that are significantly positive relatively to the other six 
probes, in this case the AAAAAATTTTTT sequence will be extended for one nucleotide 
to the right; (ii) no one probe except the starting probe has a significantly positive score, 

15 assembly will stop, e.g., the AAAAAATTTTT sequence is at the end of the DNA 
molecule that is sequenced; (iii) more than one significantly positive probe among the 
overlapped and/or other three probes is found; assembly is stopped because of the error 
or branching (Dnnanac et al, 1989). 

The processes of computational deduction would employ computer programs 

20 using existing algorithms (see, e.g., Pevzner, 1989; Drmanac et al, 1991; Labat and 
Drmanac, 1993; each incorporated herein by reference). 

If, in addition to F + P, F (space 1)P, F (space 2)P, F(space 3)P or F(space 4)P 
are determined, algorithms will be used to match all data sets to correct potential errors 
or to solve the situation where there is a branching problem (see, e.g., Drmanac et al, 

25 1989; Bains et al, 1988; each incorporated herein by reference). 

EXAMPLE 15 

Conducting Sequencing by Two Step Hybridization 

Following the certain examples to describe the execution of the sequencing 
30 methodology contemplated by the inventor. First, the whole chip would be hybridized 
with mixture of DNA as complex as 100 million of bp (one human chromosome). 
Guidelines for conducting hybridization can be found in papers such as Dnnanac et al 
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(1990); Khrapko et al. (1991); and Broude et al. (1994). These articles teach the ranges 
of hybridization temperatures, buffers and washing steps that are appropriate for use in 

the initial steps of Format 3 SBH. 

The present inventor particularly contemplates that hybridization is to be carried 
5 out for up to several hours in high salt concentrations at a low temperature (-2°C to 5°C) 
because of a relatively low concentration of target DNA that can be provided. For this 
purpose, SSC buffer is used instead of sodium phosphate buffer (Drmanac et al, 1990), 
which precipitates at 10°C. Washing does not have to be extensive (a few minutes) 
because of the second step, and can be completely eliminated when the hybridization 
10 cycling is used for the sequencing of highly complex DNA samples. The same buffer is 
used for hybridization and washing steps to be able to continue with the second 
hybridization step with labeled probes. 

After proper washing using a simple robotic device on each array, e.g., a 8 x 8 
mm array, one labeled, probe, e.g., a 6-mer, would be added. A 96-tip or 96-pin device 
15 would be used, performing this in 42 operations. Again, a range of discriminatory 
conditions could be employed, as previously described in the scientific literature. 

The present inventor particularly contemplates the use of the following conditions. 
First, after adding labeled probes and incubating for several minutes only (because of the 
high concentration of added oligonucleotides) at a low temperature (0-5°C), the 
20 temperature is increased to 3-10°C, depending on F + P length, and the washing buffer is 
added. At this time, the washing buffer used is one compatible with any ligation reaction 
(e.g., 100 mM salt concentration range). After adding ligase, the temperate is increased 
again to 15-37°C to allow fast ligation (less than 30 min) and further discrimination of full 

match and mismatch hybrids. 

25 The use of cationic detergents is also contemplated for use in Format 3 SBH, as 

described by Pontius & Berg (1991, incorporated herein by reference). These authors 
describe the use of two simple cationic detergents, dodecy- and cetyltrimethylammonium 
bromide (DTAB and CTAB) in DNA renaturation. 

DTAB and CTAB are variants of the quaternary amine tetramethylammonium 

30 bromide (TMAB) in which one of the methyl groups is replaced by either a 12-carbon 
(DTAB) or a 16-carbon (CTAB) alkyl group. TMAB is the bromide salt of the 
tetramethylammonium ion, a reagent used in nucleic acid renaturation experiments to 
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decrease the G-C content bias of the melting temperature. DTAB and CTAB are similar 
in structure to sodium dodecyl sulfate (SDS), with the replacement of the negatively 
charged sulfate of SDS by a positively charged quaternary amine. While SDS is 
commonly used in hybridization buffers to reduce nonspecific binding and inhibit 
5 nucleases, it does not greatly affect the rate of renaturation. 

When using a ligation process, the enzyme could be added with the labeled probes 
or after the proper washing step to reduce the background. Although not previously 
proposed for use in any SBH method, ligase technology is well established within the 
field of molecular biology. For example, Hood and colleagues described a 

10 ligase-mediated gene detection technique (Landegren et aL, 1988), the methodology of 
which can be readily adapted for use in Format 3 SBH. Wu & Wallace also describe the 
use of bacteriophage T4 DNA ligase to join two adjacent, short synthetic olignucleotides. 
Their oligo ligation reactions were carried out in 50 mM Tris HC1 pH 7.6, 10 mM 
MgCl 2 , 1 mM ATP, 1 mM DTT, and 5% PEG. Ligation reactions were heated to 100°C 

15 for 5-10 min followed by cooling to 0°C prior to the addition of T4 DNA ligase (1 unit; 
Bethesda Research Laboratory). Most ligation reactions were carried out at 30°C and 
terminated by heating to 100°C for 5 min. 

Final washing appropriate for discriminating detection of hybridized adjacent, or 
ligated, oligonucleotides of length (F + P), is then performed. This washing step is done 

20 in water for several minutes at 40-60°C to wash out all the non-ligated labeled probes, 
and all other compounds, to maximally reduce background. Because of the covalendy 
bound labeled oligonucleotides, detection is simplified (it does not have time and low 
temperature constrains). 

Depending on the label used, imaging of the chips is done with different apparati. 

25 For radioactive labels, phosphor storage screen technology and Phosphorlmager as a 
scanner may be used (Molecular Dynamics, Sunnyvale, CA). Chips are put in a cassette 
and covered by a phosphorous screen. After 1-4 hours of exposure, the screen is scanned 
and the image file stored at a computer hard disc. For the detection of fluorescent labels, 
CCD cameras and epifluorescent or confocal microscopy are used. For the chips 

30 generated directly on the pixels of a CCD camera, detection can be performed as 
described by Eggers et aL (1994, incorporated herein by reference). 
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Charge-coupled device (CCD) detectors serve as active solid supports that 
quantitatively detect and image the distribution of labeled target molecules in probe-based 
assays. These devices use the inherent characteristics of microelectronics that 
accommodate highly parallel assays, ultrasensitive detection, high throughput, integrated 
5 data acquisition and computation. Eggers et al. (1994) describe CCDs for use with 
probe-based assays, such as Format 3 SBH of the jresent invention, that allow 
quantitative assessment within seconds due to the high sensitivity and direct coupling 
employed. 

The integrated CCD detection approach enables the detection of molecular binding 
10 events on chips. The detector rapidly generates a two-dimensional pattern that uniquely 
characterizes the sample. In the specific operation of the CCD-based molecular detector, 
distinct biological probes are immobilized directly on the pixels of a CCD or can be 
attached to a disposable cover slip placed on the CCD surface. The sample molecules can 
be labeled with radioisotope, chemiluminescent or fluorescent tags. 
15 Upon exposure of the sample to the CCD-based probe array, photons or 

radioisotope decay products are emitted at the pixel locations where the sample has 
bound, in the case of Format 3, to two complementary probes. In turn, electron-hole 
pairs are generated in the silicon when the charged particles, or radiation from the labeled 
sample, are incident on the CCD gates. Electrons are then collected beneath adjacent 
20 CCD gates and sequentially read out on a display module. The number of photoelectrons 
generated at each pixel is directly proportional to the number of molecular binding events 
in such proximity. Consequently, molecular binding can be quantitatively determined 

(Eggers et al., 1994). 

By placing the imaging array in proximity to the sample, the collection efficiency 
25 is improved by a factor of at least 10 over lens-based techniques such as those found in 
conventional CCD cameras. That is, the sample (emitter) is in near contact with the 
detector (imaging array), and this eliminates conventional imaging optics such as lenses 
and mirrors. 

When radioisotopes are attached as reporter groups to the target molecules, 
30 energetic particles are detected. Several reporter groups that emit particles of varying 
energies have been successfully utilized with the micro-fabricated detectors, including 32 P, 
33 P, 35 S, 14 C and 12S L. The higher energy particles, such as from 32 P, provide the highest 
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molecular detection sensitivity, whereas the lower energy particles, such as from 35 S, 
provide better resolution. Hence the choice of the radioisotope reporter can be tailored as 
required. Once the particular radioisotope label is selected, the detection performance can 
be predicted by calculating the signal-to-noise ration (SNR), as described by Eggers et ai 
5 (1994). 

An alternative luminescent detection procedure involves the use of fluorescent or 
chemiluminescent reporter groups attached to the target molecules. The fluorescent labels 
can be attached covalently or through interaction. Fluorescent dyes, such as ethidium 
bromide, with intense absorption bands in the near UV (300-350 ran) range and principal 

10 emission bands in the visible (500-650 nm) range, are most suited for the CCD devices 
employed since the quantum efficiency is several orders of magnitude lower at the 
excitation wavelength then at the fluorescent signal wavelength. 

From the perspective of detecting luminescence, the polysilicon CCD gates have 
the built-in capacity to filter away the contribution of incident light in the UV range, yet 

15 are very sensitive to the visible luminescence generated by the fluorescent reporter 
groups. Such inherently large discrimination against UV excitation enables large SNRs 
(greater than 100) to be achieved by the CCDs as formulated in the incorporated paper by 
Eggers et al (1994). 

For probe immobilization on the detector, hybridization matrices may be produced 

20 on inexpensive Si0 2 wafers, which are subsequently placed on the surface of the CCD 
following hybridization and drying. This format is economically efficient since the 
hybridization of the DNA is conducted on inexpensive disposable Si0 2 wafers, thus 
allowing reuse of the more expensive CCD detector. Alternatively, the probes can be 
immobilized directly on the CCD to create a dedicated probe matrix. 

25 To immobilize probes upon the Si0 2 coating, a uniform epoxide layer is linked to 

the film surface, employing an epoxy-silane reagent and standard Si0 2 modification 
chemistry. Amine-modified oligonucleotide probes are then linked to the Si0 2 surface by 
means of secondary amine formation with the epoxide ring. The resulting linkage 
provides 17 rotatable bonds of separation between the 3 base of the oligonucleotide and 

30 the Si0 2 surface. To ensure complete amine deprotonation and to minimize secondary 
structure formation during coupling, the reaction is performed in 0.1 M KOH and 
incubated at 37°C for 6 hours. 
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In Format 3 SBH in general, signals are scored per each of billion points. It 
would not be necessary to hybridize all arrays, e.g., 4000 5 x 5 mm, at a time and the 
successive use of smaller number of arrays is possible. 

Cycling hybridizations are one possible method for increasing the hybridization 
5 signal. In one cycle, most of the fixed probes will hybridize with DNA fragments with 
tail sequences non-complementary for labeled probes. By increasing the temperature, 
those hybrids will be melted. In the next cycle, some of them (-0.1%) will hybridize 
with an appropriate DNA fragment and additional labeled probes will be ligated. In this 
case, there occurs a discriminative melting of DNA hybrids with mismatches for both 

10 probe sets simultaneously. 

In the cycle hybridization, all components are added before the cycling starts, at 
the 37°C for T4, or a higher temperature for a thermostable ligase. Then the temperature 
is decreased to 15-37°C and the chip is incubated for up to 10 minutes, and then the 
temperature is increased to 37°C or higher for a few minutes and then again reduced. 

15 Cycles can be repeated up to 10 times. In one variant, an optimal higher temperature 
(10-50°C) can be used without cycling and longer ligation reaction can be performed (1-3 
hours). 

The procedure described herein allows complex chip manufacturing using standard 
synthesis and precise spotting of oligonucleotides because a relatively small number of 
20 oligonucleotides are necessary. For example, if all 7-mer oligos are synthesized (16384 
probes), lists of 256 million 14-mers can be determined. 

One important variant of the invented method is to use more than one differently 
labeled probe per base array. This can be executed with two purposes in mind; 
multiplexing to reduce number of separately hybridized arrays; or to determine a list of 
25 even longer oligosequences such as 3 x 6 or 3 x 7. In this case, if two labels are used, 
the specificity of the 3 consecutive oligonucleotides can be almost absolute because 
positive sites must have enough signals of both labels. 

A further and additional variant is to use chips containing BxNy probes with y 
being from 1 to 4. Those chips allow sequence reading in different frames. This can 
30 also be achieved by using appropriate sets of labeled probes or both F and P probes could 
have some unspecified end positions (i.e., some element of terminal degeneracy). 
Universal bases may also be employed as part of a linker to join the probes of defined 
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sequence to the solid support. This makes the probe more available to hybridization and 
makes the construct more stable. If a probe has 5 bases, one may, e.g., use 3 universal 
bases as a linker. 

5 EXAMPLE 16 

Determining Sequence from Hybridization Data 

Sequence assembly may be interrupted where ever a given overlapping (N-l) mer 
is duplicated two or more times. Then either of the two N-mers differing in the last 
, nucleotide may be used in extending the sequence. This branching point limits 

10 unambiguous assembly of sequence. 

Reassembling the sequence of known oligonucleotides that hybridize to the target 
nucleic acid to generate the complete sequence of the target nucleic acid may not be 
accomplished in some cases. This is because some information may be lost if the target 
nucleic acid is not in fragments of appropriate size in relation to the size of 

15 oligonucleotide that is used for hybridizing. The quantity of information lost is 
proportional to the length of a target being sequenced. However, if sufficiently short 
targets are used, their sequence msy be unambiguously determined. 

The probable frequency of duplicated sequences that would interfere with sequence 
assembly which is distributed along a certain length of DNA may be calculated. This 

20 derivation requires the introduction of the definition of a parameter having to do with 
sequence organization: the sequence subfragment (SF). A sequence subfragment results 
if any part of the sequence of a target nucleic acid starts and ends with an (N-l)mer that 
is repeated two or more times within the target sequence. Thus, subfragments arc 
sequences generated between two points of branching in the process of assembly of the 

25 sequences in the method of the invention. The sum of all subfragments is longer than the 
actual target nucleic acid because of overlapping short ends. Generally, subfragments 
may not be assembled in a linear order without additional information since they have 
shared (N-l)mers at their ends and starts. Different numbers of subfragments are 
obtained for each nucleic acid target depending on the number of its repeated (N-l) mers. 

30 The number depends on the value of N-l and the length of the target. 

Probability calculations can estimate the interrelationship of the two factors. If the 
ordering of positive N-mers is accomplished by using overlapping sequences of length 
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N-l or at an average distance of A,,, the N-l of a fragment Lf bases long is given by 
equation one: 

N 5f =l+Ao X KXP(K, Lf) 
Where K greater than or = 2, and P (K, Lf) represents the probability of an 
5 N-mer occurring K-times on a fragment Lf base long. Also, a computer program that is 
able to form subfragments from the content of N-mers for any given sequence is 

described below in Example 18. 

The number of subfragments increases with the increase of lengths of fragments 
for a given length of probe. Obtained subfragments may not be uniquely ordered among 
10 themselves. Although not complete, this information is very useful for comparative 

sequence analysis and the recognition of functional sequence characteristics. This type of 
information may be called partial sequence. Another way of obtaining partial sequence is 
the use of only a subset of oligonucleotide probes of a given length. 

There may be relatively good agreement between predicted sequence according to 
15 theory and a computer simulation for a random DNA sequence. For instance, for N-l = 
7, [using an 8-mer or groups of sixteen 10-mers of type 5' (A,T,C,G) B 8 (A,T,C,G) 3'] a 
target nucleic acid of 200 bases will have an average of three subfragments. However, 
because of the dispersion around the mean, a library of target nucleic acid should have 
inserts of 500 bp so that less than 1 in 2000 targets have more than three subfragments. 
20 Thus, in an ideal case of sequence determination of a long nucleic acid of random 

sequence, a representative library with sufficiently short inserts of target nucleic acid may 
be used. For such inserts, it is possible to reconstruct the individual target by the method 
of the invention. The entire sequence of a large nucleic acid is then obtained by 
overlapping of the defined individual insert sequences. 
25 To reduce the need for very short fragments, e.g. 50 bases for 8-mer probes. The 

information contained in the overlapped fragments present in every random DNA 
fragmentation process like cloning, or random PGR is used. It is also possible to use 
pools of short physical nucleic acid fragments. Using 8-mers or 11-mers like 5' (A, T, 
C, G) N 8 (A, T, C ,G )3' for sequencing 1 megabase, instead of needing 20,000 50 bp 
30 fragments only 2,100 samples are sufficient. This number consists of 700 random 7 kb 
clones (basic library), 1250 pools of 20 clones of 500 bp (subfragments ordering library) 
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and 150 clones from jumping (or similar) library. The developed algorithm (see Example 
18) regenerates sequence using hybridization data of these described samples. 

EXAMPLE 17 
5 Algorithm 

This example describes an algorithm for generation of a long sequence written in a 
four letter alphabet from constituent k-tuple words in a minimal number of separate, 
randomly defined fragments of a starting nucleic acid sequence where K is the length of 
an oligonucleotide probe. The algorithm is primarily intended for use in the sequencing 

10 by hybridization (SBH) process. The algorithm is based on subfragments (SF), 

informative fragments (IF) and the possibility of using pools of physical nucleic sequences 
for defining informative fragments. 

As described, subfragments may be caused by branch points in the assembly 
process resulting from the repetition of a K-l oligomer sequence in a target nucleic acid. 

15 Subfragments are sequence fragments found between any two repetitive words of the 
length K-l that occur in a sequence. Multiple occurrences of K-l words are the cause of 
interruption of ordering the overlap of K-words in the process of sequence generation. 
Interruption leads to a sequence remaining in the form of subfragments. Thus, the 
unambiguous segments between branching points whose order is not uniquely determined 

20 are called sequence subfragments. 

Informative fragments are defined as fragments of a sequence that are determined 
by the nearest ends of overlapped physical sequence fragments. 

A certain number of physical fragments may be pooled without losing the 
possibility of defining informative fragments. The total length of randomly pooled 

25 fragments depends on the length of k-tuples that are used in the sequencing process. 

The algorithm consists of two main units. The first part is used for generation of 
subfragments from the set of k-tuples contained in a sequence. Subfragments may be 
generated within the coding region of physical nucleic acid sequence of certain sizes, or 
within the informative fragments defined within long nucleic acid sequences. Both types 

30 of fragments are members of the basic library. This algorithm does not describe the 
determination of the content of the k-tuples of the informative fragments of the basic 
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library, i.e. the step of preparation of informative fragments to be used in the sequence 

generation process. 

The second part of the algorithm determines the linear order of obtained 

subfragments with the purpose of regenerating the complete sequence of the nucleic acid 
5 fragments of the basic library. For this purpose a second, ordering library is used, made 

of randomly pooled fragments of the starting sequence. The algorithm does not include 
"the step of combining sequences of basic fragments to regenerate an entire, megabase plus 

sequence. This may be accomplished using the link-up of fragments of the basic library 

which is a prerequisite for informative fragment generation. Alternatively, it may be 
10 accomplished after generation of sequences of fragments of the basic library by this 

algorithm, using search for their overlap, based on the presence of common 

end-sequences. 

The algorithm requires neither knowledge of the number of appearances of a given 
k-tuple in a nucleic acid sequence of the basic and ordering libraries, nor does it require 

15 the information of which k-tuple words are present on the ends of a fragment. The 
algorithm operates with the mixed content of k-tuples of various length. The concept of 
the algorithm enables operations with the k-tuple sets that contain false positive and false 
negative k- tuples. Only in specific cases does the content of the false k-tuples primarily 
influence the completeness and correctness of the generated sequence. The algorithm 

20 may be used for optimization of parameters in simulation experiments, as well as for 
sequence generation in the actual SBH experiments e.g. generation of the genomic DNA 
sequence. In optimization of parameters, the choice of the oligonucleotide probes 
(k-tuples) for practical and convenient fragments and/or the choice of the optimal lengths 
and the number of fragments for the defined probes are especially important. 

25 This part of the algorithm has a central role in the process of the generation of the 

sequence from the content of k-tuples. It is based on the unique ordering of k-tuples by 
means of maximal overlap. The main obstacles in sequence generation are specific 
repeated sequences and false positive and/or negative k-tuples. The aim of this part of 
the algorithm is to obtain the minimal number of the longest possible subfragments, with 

30 correct sequence. This part of the algorithm consists of one basic, and several control 
steps. A two-stage process is necessary since certain information can be used only after 
generation of all primary subfragments. 
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The main problem of sequence generation is obtaining a repeated sequence from 
word contents that by definition do not carry information on the number of occurrences of 
the particular k-tuples. The concept of the entire algorithm depends on the basis on 
which this problem is solved. In principle, there are two opposite approaches: 1) 
5 repeated sequences may be obtained at the beginning, in the process of generation of 
pSFs, or 2) repeated sequences can be obtained later, in the process of the final ordering 
of the subfragments. In the first case, pSFs contain an excess of sequences and in the 
second case, they contain a deficit of sequences. The first approach requires elimination 
of the excess sequences generated, and the second requires permitting multiple use of 

10 some of the subfragments in the process of the final assembling of the sequence. 

The difference in the two approaches in the degree of strictness of the rule of 
unique overlap of k-tuples. The less severe rule is: k-tuple X is unambiguously 
maximally overlapped with k-tuple Y if and only if, the rightmost k-1 end of k-tuple X is 
present only on the leftmost end of k-tuple Y. This rule allows the generation of 

15 repetitive sequences and the formation of surplus sequences. 

A stricter rule which is used in the second approach has an addition caveat: 
k-tuple X is unambiguously maximally overlapped with k-tuple Y if and only if, the 
rightmost K-1 end of k-tuple X is present only on the leftmost end of k-tuple Y and if 
the leftmost K-1 end of k- tuple Y is not present on the rightmost end of any other 

20 k-tuple. The algorithm based on the stricter rule is simpler, and is described herein. 

The process of elongation of a given subfragment is stopped when the right k-1 
end of the last k-tuple included is not present on the left end of any k-tuple or is present 
on two or more k-tuples. If it is present on only one k-tuple the second part of the rule is 
tested. If in addition there is a k-tuple which differs from the previously included one, 

25 the assembly of the given subfragment is terminated only on the first leftmost position. If 
this additional k-tuple does not exist, the conditions are met for unique k-1 overlap and a 
given subfragment is extended to the right by one element. 

Beside the basic rule, a supplementary one is used to allow the usage of k-tuples 
of different lengths. The maximal overlap is the length of k-1 of the shorter k-tuple of 

30 the overlapping pair. Generation of the pSFs is performed starting from the first k-tuple 
from the file in which k-tuples are displayed randomly and independently from their order 
in a nucleic acid sequence. Thus, the first k-tuple in the file is not necessarily on the 
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beginning of the sequence, nor on the start of the particular subfragment. The process of 
subfragment generation is performed by ordering the k-tuples by means of unique 
overlap, which is defined by the described rule. Each used k-tuple is erased from the 
file. At the point when there are no further k-tuples unambiguously overlapping with the 
5 last one included, the building of subfragment is terminated and the buildup of another 
pSF is started. Since generation of a majority of subfragments does not begin from their 
Actual starts, the formed pSF are added to the k-tuple file and are considered as a longer 
k-tuple. Another possibility is to form subfragments going in both directions from the 
starting k- tuple. The process ends when further overlap, i.e. the extension of any of the 

10 subfragments, is not possible. 

The pSFs can be divided in three groups: 1) Subfragments of the maximal length 
and correct sequence in cases of exact k-tuple set; 2) short subfragments, formed due to 
the used of the maximal and unambiguous overlap rule on the incomplete set, and/or the 
set with some false positive k-tuples; and 3) pSFs of an incorrect sequence. The 

15 incompleteness of the set in 2) is caused by false negative results of a hybridization 
experiment, as well as by using an incorrect set of k-tuples. These are formed due to the 
false positive and false negative k-tuples and can be : a) misconnected subfragments; b) 
subfragments with the wrong end; and c) false positive k-tuples which appears as false 
minimal subfragments. 

20 Considering false positive k-tuples, there is the possibility for the presence of a k- 

tuple containing more than one wrong base or containing one wrong base somewhere in 
the middle, as well as the possibility for a k-tuple with a wrong base on the end. 
Generation of short, erroneous or misconnected subfragments is caused by the latter 
k-tuples. The k-tuples of the former two kinds represent wrong pSFs with length equal to 

25 k-tuple length. 

In the case of one false negative k-tuple, pSFs are generated because of the 
impossibility of maximal overlapping. In the case of the presence of one false positive 
k-tuple with the wrong base on its leftmost or rightmost end, pSFs are generated because 
of the impossibility of unambiguous overlapping. When both false positive and false 
30 negative k-tuples with a common k-1 sequence are present in the file, pSFs are generated, 
and one of these pSFs contains the wrong k-tuple at the relevant end. 
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The process of correcting subfragments with errors in sequence and the linking of 
unambiguously connected pSF is performed after subfragment generation and in the 
process of subfragment ordering. The first step which consists of cutting the 
misconnected pSFs and obtaining the final subfragments by unambiguous connection of 
5 pSFs is described below. 

There are two approaches for the formation of misconnected subfragments. In the 
first a mistake occurs when an erroneous k-tuple appears on the points of assembly of the 
repeated sequences of lengths k-1. In the second, the repeated sequences are shorter than 
k-1. These situations can occur in two variants each. In the first variant, one of the 

10 repeated sequences represents the end of a fragment. In the second variant, the repeated 
sequence occurs at any position within the fragment. For the first possibility, the 
absence of some k-tuples from the file (false negatives) is required to generate a 
misconnection. The second possibility requires the presence of both false negative and 
false positive k-tuples in the file. Considering the repetitions of k-1 sequence, the lack of 

15 only one k-tuple is sufficient when either end is repeated internally. The lack of two is 
needed for strictly internal repetition. The reason is that the end of a sequence can be 
considered informatically as an endless linear array of false negative k-tuples. From the 
"smaller than k-1 case", only the repeated sequence of the length of k-2, which requires 
two or three specific erroneous k-tuples, will be considered. It is very likely that these 

20 will be the only cases which will be detected in a real experiment, the others being much 
less frequent. 

Recognition of the misconnected subfragments is more strictly defined when a 
repeated sequence does not appear at the end of the fragment. In this situation, one can 
detect further two subfragments, one of which contains on its leftmost, and the other on 
25 its rightmost end k-2 sequences which are also present in the misconnected subfragment. 
When the repeated sequence is on the end of the fragment, there is only one subfragment 
which contains k-2 sequence causing the mistake in subfragment formation on its leftmost 
or rightmost end. 

The removal of misconnected subframents by their cutting is performed according 
30 to the common rule: If the leftmost or rightmost sequence of the length of k-2 of any 
subfragments is present in any other subfragment, the subfragment is to be cut into two 
subfragments, each of them containing k-2 sequence. This rule does not cover rarer 
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situations of a repeated end when there are more than one false negative k-tuple on the 
point of repeated k-1 sequence. Misconnected subfragments of this kind can be 
recognized by using the information from the overlapped fragments, or informative 
fragments of both the basic and ordering libraries. In addition, the misconnected 

5 subfragment will remain when two or more false negative k-tuples occur on both positions 
which contain the identical k-1 sequence. This is a very rare situation since it requires at 
"least 4 specific false k-tuples. An additional rule can be introduced to cut these 
subfragments on sequences of length k if the given sequence can be obtained by 
combination of sequences shorter than k-2 from the end of one subfragment and the start 

10 of another. 

By strict application of the described rule, some completeness is lost to ensure the 
accuracy of the output. Some of the subfragments will be cut although they are not 
misconnected since they fit into the pattern of a misconnected subfragment. There are 
several situations of this kind. For example, a fragment, beside at least two identical k-1 
15 sequences, contains any k-2 sequence from k-1 or a fragment contains k-2 sequence 
repeated at least twice and at least one false negative k-tuple containing given k-2 

sequence in the middle, etc. 

The aim of this part of the algorithm is to reduce the number of pSFs to a minimal 
number of longer subfragments with correct sequence. The generation of unique longer 

20 subfragments or a complete sequence is possible in two situations. The first situation 
concerns the specific order of repeated k-1 words. There are cases in which some or all 
maximally extended pSFs (the first group of pSFs) can be uniquely ordered. For 
example, in fragment S-Rl-a-R2-b-Rl-c-R2-E where S and E are the start and end of a 
fragment, a, b , and c are different sequences specific to respective subfragments and Rl 

25 and R2 are two k-1 sequences that are tandemly repeated, five subfragments are generated 
(S-Rl, Rl-a-R2, R2-b-Rl, R1-C-R2, and R-E). They may be ordered in two ways; the 
original sequence above or S-Rl- c-R-b-Rl-a-R-E. In contrast, in a fragment with the 
same number and types of repeated sequences but ordered differendy, i.e. 
S-Rl-a-Rl-b-R-c-R-E, there is no other sequence which includes all subfragments. 

30 Examples of this type can be recognized only after the process of generation of pSFs. 
They represent the necessity for two steps in the process of pSF generation. The second 
situation of generation of false short subfragments on positions of nonrepeated k-1 
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sequences when the files contain false negative and /or positive k-tuples is more 
important. 

The solution for both pSF groups consists of two parts. First, the false positive k- 
tuples appearing as the nonexisting minimal subfragments are eliminated. All k-tuple 
5 subfragments of length k which do not have an overlap on either end, of the length of 
longer than k-a on one end and longer than k-b on the other end, are eliminated to enable 
formation of the maximal number of connections. In our experiments, the values for a 
and b of 2 and 3, respectively, appeared to be adequate to eliminate a sufficient number 
of false positive k-tuples. 

10 The merging of subfragments that can be uniquely connected is accomplished in 

the second step. The rule for connection is: two subfragments may be unambiguously 
connected if, and only if, the overlapping sequence at the relevant end or start of two 
subfragments is not present at the start and/or end of any other subfragment. 

The exception is if one subfragment from the considered pair has the identical 

15 beginning and end. in that case connection is permitted, even if there is another 

subfragment with the same end present in the file. The main problem here is the precise 
definition of overlapping sequence. The connection is not permitted if the overlapping 
sequence unique for only one pair of subfragments is shorter than k-2, of it is k-2 or 
longer but an additional subfragment exists with the overlapping sequence of any length 

20 longer than k-4. Also, both the canonical ends of pSFs and the ends after omitting one 
(or few) last bases are considered as the overlapping sequences. 

After this step some false positive k-tuples (as minimal subfragments) and some 
subfragments with a wrong end may survive. In addition, in very rare occasions where a 
certain number of some specific false k-tuples are simultaneously present, an erroneous 

25 connection may take place. These cases will be detected and solved in the subfragment 
ordering process, and in the additional control steps along with the handling of uncut 
"misconnected" subfragments. 

The short subfragments that are obtained are of two kinds. In the common case, 
these subfragments may be unambiguously connected among themselves because of the 

30 distribution of repeated k-1 sequences. This may be done after the process of generation 
of pSFs and is a good example of the necessity for two steps in the process of pSF 
generation. In the case of using the file containing false positive and/or false negative 
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k-tuples, short pSFs are obtained on the sites of nonrepeated k-1 sequences. Considering 
false positive k-tuples, a k-tuple may contain more than one wrong base (or containing 
one wrong base somewhere in the middle), as well as k-tuple on the end. Generation of 
short and erroneous (or misconnected) subfragments is caused by the latter k-tuples. The 
5 k-tuples of the former kind represent wrong pSFs with length equal to k- tuple length. 

The aim of merging pSF part of the algorithm is the reduction of the number of 
~ pSFs to the minimal number of longer subfragments with the correct sequence. All 
k-tuple subfragments that do not have an overlap on either end, of the length of longer 
than k-a on one, and longer than k-b on the other end, are eliminated to enable the 
10 maximal number of connections. In this way, the majority of false positive k-tuples are 
discarded. The rule for connection is: two subfragments can be unambiguously 
connected if, and only if the overlapping sequence of the relevant end or start of two 
subfragments is not present on the start and/or end of any other subfragment. The 
exception is a subfragment with the identical beginning and end. In that case connection 
15 is permitted, provided that there is another subfragment with the same end present in the 
file. The main problem here is of precise definition of overlapping sequence. The 
presence of at least two specific false negative k- tuples on the points of repetition of k-1 
or k-2 sequences, as well as combining of the false positive and false negative k-tuples 
may destroy or "mask" some overlapping sequences and can produce an unambiguous, 
20 but wrong connection of pSFs. To prevent this, completeness must be sacrificed on 
account of exactness: the connection is not permitted on the end-sequences shorter than 
k-2, and in the presence of an extra overlapping sequence longer than k-4. The 
overlapping sequences are defined from the end of the pSFs, or omitting one, or few last 
bases. 

25 In the very rare situations, with the presence of a certain number of some specific 

false positive and false negative k-tuples, some subfragments with the wrong end can 
survive, some false positive k-tuples (as minimal subfragments) can remain, or the 
erroneous connection can take place. These cases are detected and solved in the 
subfragments ordering process, and in the additional control steps along with the handling 

30 of uncut, misconnected subfragments. 

The process of ordering of subfragments is similar to the process of their 
generation. If one considers subfragments as longer k-tuples, ordering is performed by 
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their unambiguous connection via overlapping ends. The informational basis for 
unambiguous connection is the division of subfragments generated in fragments of the 
basic library into groups representing segments of those fragments. The method is 
analogous to the biochemical solution of this problem based on hybridization with longer 
5 oligonucleotides with relevant connecting sequence. The connecting sequences are 
generated as subfragments using the k-tuple sets of the appropriate segments of basic 
library fragments. Relevant segments are defined by the fragments of the ordering library 
that overlap with the respective fragments of the basic library. The shortest segments are 
, informative fragments of the ordering library. The longer ones are several neighboring 
10 informative fragments or total overlapping portions of fragments corresponding of the 
ordering and basic libraries. In order to decrease the number of separate samples, 
fragments of the ordering library are randomly pooled, and the unique k- tuple content is 
determined. 

By using the large number of fragments in the ordering library very short 

15 segments are generated, thus reducing the chance of the multiple appearance of the k-1 
sequences which are the reasons for generation of the subfragments. Furthermore, longer 
segments, consisting of the various regions of the given fragment of the basic library , do 
not contain some of the repeated k-1 sequences. In every segment a connecting sequence 
(a connecting subfragment) is generated for a certain pair of the subfragments from the 

20 given fragment. The process of ordering consists of three steps: (1) generation of the k- 
tuple contents of each segment; (2) generation of subfragments in each segment; and (3) 
connection of the subfragments of the segments. Primary segments are defined as 
significant intersections and differences of k-tuple contents of a given fragment of the 
basic library with the k-tuple contents of the pools of the ordering library. Secondary 

25 (shorter) segments are defined as intersections and differences of the k-tuple contents of 
the primary segments. 

There is a problem of accumulating both false positive and negative k-tuples in 
both the differences and intersections. The false negative k-tuples from starting sequences 
accumulate in the intersections (overlapping parts), as well as false positive k-tuples 

30 occurring randomly in both sequences, but not in the relevant overlapping region. On the 
other hand, the majority of false positives from either of the starting sequences is not 
taken up into intersections. This is an example of the reduction of experimental errors 
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from individual fragments by using information from fragments overlapping with them. 
The false k- tuples accumulate in the differences for another reason. The set of false 
negatives from the original sequences are enlarged for false positives from intersections 
and the set of false positives for those k-tuples which are not included in the intersection 

5 by error, i.e. are false negative in the intersection. If the starting sequences contain 10% 

false negative data, the primary and secondary intersections will contain 19% and 28% 
~~ false negative k- tuples, respectively. On the other hand, a mathematical expectation of 
77 false positives may be predicted if the basic fragment and the pools have lengths of 
500 bp and 10,000 bp, respectively. However, there is a possibility of recovering most 

10 of the "lost" k-tuples and of eliminating most of the false positive k-tuples. 

First, one has to determine a basic content of the k-tuples for a given segment as 
the intersection of a given pair of the k-tuple contents. This is followed by including all 
k- tuples of the starting k-tuple contents in the intersection, which contain at one end k-1 
and at the other end k-+ sequences which occur at the ends of two k-tuples of the basic 

15 set. This is done before generation of the differences thus preventing the accumulation of 
false positives in that process. Following that, the same type of enlargement of k-tuple 
set is applied to differences with the distinction that the borrowing is from the 
intersections. All borrowed k-tuples are eliminated from the intersection files as false 

positives. 

20 The intersection, i.e. a set of common k-tuples, is defined for each pair (a basic 

fragment) X (a pool of ordering library). If the number of k-tuples in the set is 
significant it is enlarged with the false negatives according to the described rule. The 
primary difference set is obtained by subtracting from a given basic fragment the obtained 
intersection set. The false negative k-tuples are appended to the difference set by 

25 borrowing from the intersection set according to the described rule and, at the same time, 
removed from the intersection set as false positive k-tuples. When the basic fragment is 
longer than the pooled fragments, this difference can represent the two separate segments 
which somewhat reduces its utility in further steps. The primary segments are all 
generated intersections and differences of pairs (a basic fragment) X (a pool of ordering 

30 library) containing the significant number of k-tuples. K-tuple sets of secondary segments 
are obtained by comparison of k-tuple sets of all possible pairs of primary segments. The 
two differences are defined from each pair which produces the intersection with the 
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significant number of k-tuples. The majority of available information from overlapped 
fragments is recovered in this step so that there is little to be gained from the third round 
of forming intersections, and differences. 

(2) Generation of the subfragments of the segments is performed identically as 
5 described for the fragments of the basic library. 

(3) The method of connection of subfragments consists of sequentially 
determining the correctly linked pairs of subfragments among the subfragments from a 
given basic library fragment which have some overlapped ends. In the case of 4 relevant 
subfragments, two of which contain the same beginning and two having the same end, 

10 there are 4 different pairs of subfragments that can be connected. In general 2 are. correct 
and 2 are wrong. To find correct ones, the presence of the connecting sequences of each 
pair is tested in the subfragments generated from all primary and secondary segments for 
a given basic fragment. The length and the position of the connecting sequence are 
chosen to avoid interference with sequences which occur by chance. They are k+2 or 

15 longer, and include at least one element 2 beside overlapping sequence in both 

subfragments of a given pair. The connection is permitted only if the two connecting . 
sequences are found and the remaining two do not exist. The two linked subfragments 
replace former subfragments in the file and the process is cyclically repeated. 

Repeated sequences are generated in this step. This means that some subfragments 

20 are included in linked subfragments more than once. They will be recognized by finding 
the relevant connecting sequence which engages one subfragment in connection with two 
different subfragments. 

The recognition of misconnected subfragments generated in the processes of 
building pSFs and merging pSFs into longer subfragments is based on testing whether the 

25 sequences of subfragments from a given basic fragment exist in the sequences of 
subfragments generated in the segments for the fragment. The sequences from an 
incorrectly connected position will not be found indicating the misconnected 
subfragments. 

Beside the described three steps in ordering of subfragments some additional 
30 control steps or steps applicable to specific sequences will be necessary for the generation 
of more complete sequence without mistakes. 
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The detennination of which subfragment belongs to which segment is performed b 
comparison of contents of k-tuples in segments and subfragments. Because of the errors 
in the k-tuple contents (due to the primary error in pools and statistical errors due to the 
frequency of occurrences of k-tuples) the exact partitioning of subfragments is impossible. 
5 Thus, instead of "all or none" partition, the chance of coming from the given segment 
(P(sf,s)) is determined for each subfragment. This possibility is the function of the 
lengths of k-tuples, the lengths of subfragments, the lengths of fragments of ordering 
library, the size of the pool, and of the percentage of false k-tuples in the file: 

P(sf,s)=(Ck-F)/Lsf, 

10 where Lsf is the length of subfragment, Ck is the number of common k-tuples for a given 
subfragment/segment pair, and F is the parameter that includes relations between lengths 
of k-tuples, fragments of basic library, the size of the pool, and the error percentage. 

Subfragments attributed to a particular segment are treated as redundant short pSFs 
and are submitted to a process of unambiguous connection. The definition of 

15 unambiguous connection is slightly different in this case, since it is based on a probability 
that subfragments with overlapping end(s) belong to the segment considered. Besides, the 
accuracy of unambiguous connection is controlled by following the connection of these 
subfragments in other segments. After the connection in different segments, all of the 
obtained subfragments are merged together, shorter subfragments included within longer 

20 ones are eliminated, and the remaining ones are submitted to the ordinary connecting 
process. If the sequence is not regenerated completely, the process of partition and 
connection of subfragments is repeated with the same or less severe criterions of 
probability of belonging to the particular segment, followed by unambiguous connection. 
Using severe criteria for defining unambiguous overlap, some information is not 

25 used. Instead of a complete sequence, several subfragments that define a number of 
possibilities for a given fragment are obtained. Using less severe criteria an accurate and 
complete sequence is generated. In a certain number of situations, e.g. an erroneous 
connection, it is possible to generate a complete, but an incorrect sequence, or to generate 
"monster" subfragments with no connection among them. Thus, for each fragment of the 
30 basic library one obtains: a) several possible solutions where one is correct and b) the 
most probable correct solution. Also, in a very small number of cases, due to the 
mistake in the subfragment generation process or due to the specific ratio of the 
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probabilities of belonging, no unambiguous solution is generated or one, the most 
probable solution. These cases remain as incomplete sequences, or the unambiguous 
solution is obtained by comparing these data with other, overlapped fragments of basic 
library. 

5 The described algorithm was tested on a randomly generated, 50 kb sequence, 

containing 40% GC to simulate the GC content of the human genome. In the middle part 
of this sequence were inserted various All, and some other repetitive sequences, of a total 
length of about 4 kb. To simulate an in vitro SBH experiment, the following operations 
were performed to prepare appropriate data. 

10 - Positions of sixty 5 kb overlapping "clones" were randomly defined, to simulate 

preparation of a basic library: 

- Positions of one thousand 500 bp "clones" were randomly determined to simulate 
making the ordering library. These fragments were extracted from the sequence. 
Random pools of 20 fragments were made, and k-tuple sets of pools were determined and 

15 stored on the hard disk. These data are used in the subfragment ordering phase: For the 
same density of clones 4 million clones in basic library and 3 million clones in ordering 
library are used for the entire human genome. The total number of 7 million clones is 
several fold smaller than the number of clones a few kb long for random cloning of 
almost all of genomic DNA and sequencing by a gel-based method. 

20 From the data on the starts and ends of 5 kb fragments, 117 "informative 

fragments" were determined to be in the sequence. This was followed by determination 
of sets of overlapping k-tuples of which the single "informative fragment" consist. Only 
the subset of k-tuples matching a predetermined list were used. The list contained 65% 
8-mers, 30% 9-mers, and 5% 10-12-mers. Processes of generation and the ordering of 

25 subfragments were performed on these data. 

The testing of the algorithm was performed on the simulated data in two 
experiments. The sequence of 50 informative fragments was regenerated with the 100% 
correct data set (over 20,000 bp), and 26 informative fragments (about 10,000 bp) with 
10% false k-tuples (5% positive and 5% negative ones). 

30 In the first experiment, all subfragments were correct and in only one out of 50 

informative fragments the sequence was not completely regenerated but remained in the 
form of 5 subfragments. The analysis of positions of overlapped fragments of ordering 
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library has shown that they lack the information for the unique ordering of the 5 
subfragments. The subfragments may be connected in two ways based on overlapping 
ends, 1-2-3-4-5 and 1-4-3-2-5. The only difference is the exchange of positions of 
subfragments 2 and 4. Since subfragments 2, 3, and 4 are relatively short (total of about 

5 100 bp), the relatively greater chance existed, and occurred in this case, that none of the 
fragments of ordering library started or ended in the subfragment 3 region. 

To simulate real sequencing, some false ("hybridization") data was included as 
input in a number of experiments. In oligomer hybridization experiments, under 
proposed conditions, the only situation producing unreliable data is the end mismatch 

10 versus full match hybridization. Therefore, in simulation only those k-tuples differing in 
a single element on either end from the real one were considered to be false positives. 
These "false" sets are made as follows. On the original set of a k-tuples of the 
informative fragment, a subset of 5% false positive k-tuples are added. False positive 
k-tuples are made by randomly picking a k-tuple from the set, copying it and altering a 

15 nucleotide on its beginning or end. This is followed by subtraction of a subset of 5% 
randomly chosen k-tuples. In this way the statistically expected number of the most 
complicated cases is generated in which the correct k-tuple is replaced with a k-tuple with 

the wrong base on the end. 

Production of k-tuple sets as described leads to up to 10% of false data. This 

20 value varies from case to case, due to the randomness of choice of k-tuples to be copied, 
altered, and erased. Nevertheless, this percentage 3-4 times exceeds the amount of 
unreliable data in real hybridization experiments. The introduced error of 10% leads to 
the two fold increase in the number of subfragments both in fragments of basic library 
(basic library informative fragments) and in segments. About 10% of the final 

25 subfragments have a wrong base at the end as expected for the k-tuple set which contains 
false positives (see generation of primary subfragments). Neither the cases of 
misconnection of subfragments nor subfragments with the wrong sequence were observed. 
In 4 informative fragments out of 26 examined in the ordering process the complete 
sequence was not regenerated. In all 4 cases the sequence was obtained in the form of 

30 several longer subfragments and several shorter subfragments contained in the same 
segment. This result shows that the algorithmic principles allow working with a large 
percentage of false data. 
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The success of the generation of the sequence from its k-tuple content may be 
described in terms of completeness and accuracy. In the process of generation, two 
particular situations can be defined: 1) Some part of the information is missing in the 
generated sequence, but one knows where the ambiguities are and to which type they 
5 belong, and 2) the regenerated sequence that is obtained does not match the sequence 
from which the k- tuple content is generated, but the mistake can not be detected. 
Assuming the algorithm is developed to its theoretical limits, as in the use of the exact 
k-tuple sets, only the first situation can take place. There the incompleteness results in a 
certain number of subfragments that may not be ordered unambiguously and the problem 
10 of determination of the exact length of monotonous sequences, i.e. the number of perfect 
tandem repeats. 

With false k-tuples, incorrect sequence may be generated. The reason for 
mistakes does not lie in the shortcomings of the algorithm, but in the fact that a given 
content of k-tuples unambiguously represents the sequence that differs from the original 

15 one. One may define three classes of error, depending on the kind of the false k- tuples 
present in the file. False negative k-tuples (which are not accompanied with the false 
positives) produce "deletions". False positive k-tuples are producing "elongations 
(unequal crossing over)". False positives accompanied with false negatives are the reason 
for generation of "insertions", alone or combined with "deletions". The deletions are 

20 produced when all of the k-tuples (or their majority) between two possible starts of the 
subfragments are false negatives. Since every position in the sequence is defined by k k- 
tuples, the occurrence of the deletions in a common case requires k consecutive false 
negatives. (With 10% of the false negatives and k= 8, this situation takes place after 
every 108 elements). This situation is extremely infrequent even in mammalian genome 

25 sequencing using random libraries containing ten genome equivalents. 

Elongation of the end of the sequence caused by false positive k-tuples is the 
special case of "insertions", since the end of the sequence can be considered as the 
endless linear array of false negative k-tuples. One may consider a group of false 
positive k-tuples producing subfragments longer than one k-tuple. Situations of this kind 

30 may be detected if subfragments are generated in overlapped fragments, like random 
physical fragments of the ordering library. An insertion, or insertion in place of a 
deletion, can arise as a result of specific combinations of false positive and false negative 
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k-tuples. In the first case, the number of consecutive false negatives is smaller than k. 
Both cases require several overlapping false positive k-tuples. The insertions and 
deletions are mostly theoretical possibilities without sizable practical repercussions since 
the requirements in the number and specificity of false k-tuples are simply too high. 

5 In every other situation of not meeting the theoretical requirement of the minimal 

number an the kind of the false positive and/or negatives, mistakes in the k-tuples content 
may produce only the lesser completeness of a generated sequence. 

SBH, a sample nucleic acid is sequenced by exposing the sample to a 
support-bound probe of known sequence and a labeled probe or probes in solution. 

10 Wherever the probes ligase is introduced into the mixture of probes and sample, such 
that, wherever a support has a bound probe and a labeled probe hybridized back to back 
along the sample, the two probes will be chemically linked by the action of the ligase. 
After washing, only chemically linked support-bound and labeled probes are detected by 
the presence of the labeled probe. By knowing the identity of the support-bound probe at 

15 a particular location in an array, and the identity of the labeled probe, a portion of the 
sequence of the sample may be determined by the presence of a label at a point in an 
array on a Format with a sample of three substrate. And not chances not working are 
maximally overlapping sequences of all of the ligated probe pairs, the sequence of the 
sample may be reconstructed. Not of the sample to be sequenced may be a nucleic acid 

20 fragment or oligonucleotide of ten base pairs ("bp"). The sample is preferably four to 

one thousand bases in length. 

The length of the probe is a fragment less than ten bases in length, and, 
preferably, is between four and nine bases in length. In this way, arrays of 
support-bound probes may include all oligonucleotides of a given length or may include 
25 only oligonucleotides selected for a particular test. Where all oligonucleotides of a given 
length are used, the number of central oligonucleotides may be calculated by 4 N where N 
is the length of the probe. 
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EXAMPLE 18 

Re-Using Sequencing Chip s 

When ligation is employed in the sequencing process, then the ordinary 
oligonucleotide chip cannot be immediately reused. The inventor contemplates that this 
5 may be overcome in various ways. 

One may employ ribonucleotides for the second probe, probe P, so that this probe 
may subsequently be removed by RNAse treatment. RNAse treatment may utilize RNAse 
A an endoribonuclease that specifically attacks single-stranded RNA 3 to pyrimidine 
residues and cleaves the phosphate linkage to the adjacent nucleotide. The end products 
10 are pyrimidine 3 phosphates and oligonucleotides with terminal pyrimidine 3 phosphates. 
RNAse A works in the absence of cof actors and divalent cations. 

To utilize an RNAse, one would generally incubate the chip in any appropriate 
RNAse-containing buffer, as described by Sambrook et al (1989; incorporated herein by 
reference). The use of 30-50 ul of RNAse-containing buffer per 8 x 8 mm or 9 x 9 mm 
15 array at 37°C for between 10 and 60 minutes is appropriate. One would then wash with 
hybridization buffer. 

Although not widely applicable, one could also use the uracil base, as described by 
Craig et al (1989), incorporated herein by reference, in specific embodiments. 
Destruction of the ligated probe combination, to yield a re-usable chip, would be achieved 
20 by digestion with the E. Coli repair enzyme, uracil-DNA glycosylase which removes 
uracil from DNA. 

One could also generate a specifically cleavable bond between the probes and then 
cleave the bond after detection. For example, this may be achieved by chemical ligation 
as described by Shabarova et al, (1991) and Dolinnaya et al, (1988), both references 

25 being specifically incorporated herein by reference. 

Shabarova et al. (1991) describe the condensation of oligodeoxyribonucleotides 
with cyanogen bromide as a condensing agent. In their one step chemical ligation 
reaction, the oligonucleotides are heated to 97°C, slowly cooled to 0°C, then 1 ul 10 mM 
BrCN in acetonitrile is added. 

30 Dolinnaya et al (1988) show how to incorporate phosphoramidiate and 

pyrophosphate internucleotide bonds in DNA duplexes. They also use a chemical ligation 
method for modification of the sugar phosphate backbone of DNA, with a water-soluble 
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carbodiimide (CDI) as a coupling agent. The selective cleavage of a phosphoamide bond 
involves contact with 15% CH 3 COOH for 5 min at 95°C. The selective cleavage of a 
pyrophosphate bond involves contact with a pyridine- water mixture (9:1) and freshly 
distilled (CF 3 CO) 2 0. 

5 

EXAMPLE 19 

Diagnostics - Scoring Known Mutations or Full Gene Resequencing 

In a simple case, the goal may be to discover whether selected, known mutations 
occur in a DNA segment. Less than 12 probes may suffice for this purpose, for example, 
10 5 probes positive for one allele, 5 positive for the other, and 2 negative for both. Because 

i 

of the small number of probes to be scored per sample, large numbers of samples may be 
analyzed in parallel. For example, with 12 probes in 3 hybridization cycles, 96 different 
genomic loci or gene segments from 64 patient may be analyzed on one 6 x 9 in 
membrane containing 12 x 24 subarrays each with 64 dots representing the same DNA 

15 segment from 64 patients. In this example, samples may be prepared in sixty-four 96- well 
plates. Each plate may represent one patient, and each well may represent one of the 
DNA segments to be analyzed. The samples from 64 plates may be spotted in four 
replicas as four quarters of the same membrane. 

A set of 12 probes may be selected by single channel pipetting or by a single pin 

20 transferring device (or by an array of individually-controlled pipets or pins) for each of 
the 96 segments, and the selected probes may be arrayed in twelve 96- well plates. Probes 
may be labelled, if they are not prelabelled, and then probes from four plates may be 
mixed with hybridization buffer and added to the subarrays preferentially by a 96-channel 
pipeting device. After one hybridization cycle it is possible to strip off previously-applied 

25 probes by incubating the membrane at 37° to 55°C in the preferably undiluted 
hybridization or washing buffer. 

The likelihood that probes positive for one allele are positive and probes positive 
for the other allele are negative may be used to determine which of the two alleles is 
present. In this redundant scoring scheme, some level (about 10%) of errors in 

30 hybridization of each probe may be tolerated. 

An incomplete set of probes may be used for scoring most of the alleles, 
especially if the smaller redundancy is sufficient, e.g. one or two probes which prove the 
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presence or absence in a sample of one of the two alleles. For example, with a set of four 
thousand 8-mers there is a 91 % chance of finding at least one positive probe for one of 
the two alleles for a randomly selected locus. The incomplete set of probes may be 
optimized to reflect G+C content and other biases in the analyzed samples. 
5 For full gene sequencing, genes may be amplified in an appropriate number of 

segments. For each segment, a set of probes (about one probe per 2-4 bases) may be 
selected and hybridized. These probes may identify whether there is a mutation anywhere 
in the analyzed segments. Segments (i.e., subarrays which contain these segments) where 
one or more mutated sites are detected may be hybridized with additional probes to find 

10 the exact sequence at the mutated sites. If a DNA sample is tested by every second 
6-mer, and a mutation is localized at the position that is surrounded by positively 
hybridized probes TGCAAA and TATTCC and covered by three negative probes: 
CAAAAC, AAACTA and ACTATT, the mutated nucleotides must be A and/or C 
occurring in the normal sequence at that position. They may be changed by a single base 

15 mutation, or by a one or two nucleotide deletion and/or insertion between bases AA, AC 
or CT. 

One approach is to select a probe that extends the positively hybridized probe 
TGCAAA for one nucleotide to the right, and which extends the probe TATTCC one 
nucleotide to the left. With these 8 probes (GCAAAA, GCAAAT, GCAAAC, GCAAAG 
20 and ATATTC, TTATTC, CTATTC, GTATTC) two questionable nucleotides are 
determined. 

The most likely hypothesis about the mutation may be determined. For example, A 
is found to be mutated to G. There are two solutions satisfied by these results. Either 
replacement of A with G is the only change or there is in addition to that change an 

25 insertion of some number of bases between newly determined G and the following C. If 
the result with bridging probes is negative these options may then be checked first by at 
least one bridging probe comprising the mutated position (AAGCTA) and with an 
additional 8 probes: CAAAGA, CAAAGT, CAAAGC, CAAAGG and ACTATT, 
TCTATT, CCTATT, GCTATT, I There are many other ways to select mutation-solving 

30 probes. 

In the case of diploid, particular comparisons of scores for the test samples and 
homozygotic control may be performed to identify heterozygotes (see above). A few 
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consecutive probes are expected to have roughly twice smaller signals if the segment 
covered by these probes is mutated on one of the two chromosomes. 

EXAMPLE 20 

5 Identification of Genes (Mutations) Responsible 
for Genetic Disorders and Oth er Traits 

Using universafsets of longer probes (8-mers or 9-mers) on immobilized arrays of 
samples, DNA fragments as long as 5-20 kb may be sequenced without subcloning. 
Furthermore, the speed of sequencing readily may be about 10 million 
10 bp/day/hybridization instrument. This performance allows for resequencing a large 

fraction of human genes or the human genome repeatedly from scientifically or medically 
interesting individuals. To resequence 50% of the human genes, about 100 million bp is 
checked. That may be done in a relatively short period of time at an affordable cost. 
This enormous resequencing capability may be used in several ways to identify 
15 mutations and/or genes that encode for disorders or any other traits. Basically, mRNAs 
(which may be converted into cDNAs) from particular tissues or genomic DNA of 
patients with particular disorders may be used as starting materials. From both sources of 
DNA, separate genes or genomic fragments of appropriate length may be prepared either 
by cloning procedures or by in vitro amplification procedures (for example by PCR). If 
20 cloning is used, the minimal set of clones to be analyzed may be selected from the 
libraries before sequencing. That may be done efficiently by hybridization of a small 
number of probes, especially if a small number of clones longer than 5 kb is to be sorted. 
Cloning may increase the amount of hybridization data about two times, but does not 
require tens of thousands of PCR primers. 
25 In one variant of the procedure, gene or genomic fragments may be prepared by 

restriction cutting with enzymes like Hga I which cuts DNA in following way: 
GACGC(N5')/CTGCG(N10'). Protruding ends of five bases are different for different 
fragments. One enzyme produces appropriate fragments for a certain number of genes. By 
cutting cDNA or genomic DNA with several enzymes in separate reactions, every gene of 
30 interest may be excised appropriately. In one approach, the cut DNA is fractionated by 
size. DNA fragments prepared in this way (and optionally treated with Exonuclease III 
which individually removes nucleotides from the 3' end and increases length and 
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specificity of the ends) may be dispensed in the tubes or in multiwell plates. From a 
relatively small set of DNA adapters with a common portion and a variable protruding 
end of appropriate length, a pair of adapters may be selected for every gene fragment that 
needs to be amplified. These adapters are ligated and then PCR is performed by universal 
5 primers. From 1000 adapters, a million pairs may be generated, thus a million different 
fragments may be specifically amplified in the identical conditions with a universal pair of 
primers complementary to the common end of the adapters. 

If a DNA difference is found to be repeated in several patients, and that sequence 
, change is nonsense or can change function of the corresponding protein, then the mutated 
10 gene may be responsible for the disorder. By analyzing a significant number of 

individuals with particular traits, functional allelic variations of particular genes could be 
associated by specific traits. 

This approach may be used to eliminate the need for very expensive genetic 
mapping on extensive pedigrees and has special value when there is no such genetic data 
15 or material. 

EXAMPLE 21 

Scoring Single Nucleotide Polymorphisms in Genetic Mapp in g 

Techniques disclosed in this application are appropriate for an efficient 

20 identification of genomic fragments with single nucleotide polymorphisms (SNUPs). In 
10 individuals by applying the described sequencing process on a large number of 
genomic fragments of known sequence that may be amplified by cloning or by in vitro 
amplification, a sufficient number of DNA segments with SNUPs may be identified. The 
polymorphic fragments are further used as SNUP markers. These markers are either 

25 mapped previously (for example they represent mapped STSs) or they may be mapped 
through the screening procedure described below. 

SNUPs may be scored in every individual from relevant families or populations by 
amplifying markers and arraying them in the form of the array of subarrays. Subarrays 
contain the same marker amplified from the analyzed individuals. For each marker, as in 

30 the diagnostics of known mutations, a set of 6 or less probes positive for one allele and 6 
or less probes positive for the other allele may be selected and scored. From the 
significant association of one or a group of the markers with the disorder, chromosomal 
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position of the responsible gene(s) may be determined. Because of the high throughput 
and low cost, thousands of markers may be scored for thousands of individuals. This 
amount of data allows localization of a gene at a resolution level of less than one million 
bp as well as localization of genes involved in polygenic diseases. Localized genes may 
5 be identified by sequencing particular regions from relevant normal and affected 
individuals to score a mutation(s). 

PGR is preferred for amplification of markers from genomic DNA. Each of the 
markers require a specific pair of primers. The existing markers may be convertible or 
new markers may be defined which may be prepared by cutting genomic DNA by Hga I 
10 type restriction enzymes, and by ligation with a pair of adapters. 

SNUP markers can be amplified or spotted as pools to reduce the number of 
independent amplification reactions. In this case, more probes are scored per one sample. 
When 4 markers are pooled and spotted on 12 replica membranes, then 48 probes (12 per 
marker) may be scored in 4 cycles. 

15 

EXAMPLE 22 

Detection and Verification of Identity of DNA Fragments 

DNA fragments generated by restriction cutting, cloning or in vitro amplification 
(e.g. PCR) frequently may be identified in a experiment. Identification may be 

20 performed by verifying the presence of a DNA band of specific size on gel 

electrophoresis. Alternatively, a specific oligonucleotide may be prepared and used to 
verify a DNA sample in question by hybridization. The procedure developed here allows 
for more efficient identification of a large number of samples without preparing a specific 
oligonucleotide for each fragment. A set of positive and negative probes may be selected 

25 from the universal set for each fragment on the basis of the known sequences. Probes that 
are selected to be positive usually are able to form one or a few overlapping groups and 
negative probes are spread over the whole insert. 

This technology may be used for identification of STSs in the process of their 
mapping on the YAC clones. Each of the STSs may be tested on about 100 YAC clones 

30 or pools of YAC clones. DNAs from these 100 reactions possibly are spotted in one 
subarray. Different STSs may represent consecutive subarrays. In several hybridization 
cycles, a signature may be generated for each of the DNA samples, which signature 
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proves or disproves existence of the particular STS in the given YAC clone with 
necessary confidence. 

To reduce the number of independent PCR reactions or the number of independent 
samples for spotting, several STSs may be amplified simultaneously in a reaction or PCR 
5 samples may be mixed, respectively. In this case more probes have to be scored per one 
dot. The pooling of STSs is independent of pooling YACs and may be used on single 
YACs or pools of YACs. This scheme is especially attractive when several probes 
labelled with different colors are hybridized together. 

In addition to confirmation of the existence of a DNA fragment in a sample, the 

10 amount of DNA may be estimated using intensities of the hybridization of several separate 
probes or one or more pools of probes. By comparing obtained intensities with intensities 
for control samples having a known amount of DNA, the quantity of DNA in all spotted 
samples is determined simultaneously. Because only a few probes are necessary for 
identification of a DNA fragment, and there are N possible probes that may be used for 

15 DNA N bases long, this application does not require a large set of probes to be sufficient 
for identification of any DNA segment. From one thousand 8-mers, on average about 30 
fiill matching probes may be selected for a 1000 bp fragment. 

EXAMPLE 23 

20 Identification of Infectious Disease Organisms and Their Variants 

DNA-based tests for the detection of viral, bacterial, fungal and other parasitic 
organisms in patients are usually more reliable and less expensive than alternatives. The 
major advantage of DNA tests is to be able to identify specific strains and mutants, and 
eventually be able to apply more effective treatment. Two applications are described 

25 below. 

The presence of 12 known antibiotic resistance genes in bacterial infections may 
be tested by amplifying these genes. The amplified products from 128 patients may be 
spotted in two subarrays and 24 subarrays for 12 genes may then be repeated four times 
on a 8 x 12 cm membrane. For each gene, 12 probes may be selected for positive and 
30 negative scoring. Hybridizations may be performed in 3 cycles. For these tests, a much 
smaller set of probes is most likely to be universal. For example, from a set of one 
thousand 8-mers, on average 30 probes are positive in 1000 bp fragments, and 10 positive 
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probes are usually sufficient for a highly reliable identification. As described in Example 
9, several genes may be amplified and/or spotted together and the amount of the given 
DNA may be determined. The amount of amplified gene may be used as an indicator of 
the level of infection. 

5 Another example involves possible sequencing of one gene or the whole genome 

of an fflV virus. Because of rapid diversification, the virus poses many difficulties for 
selection of an optimal therapy. DNA fragments may be amplified from isolated viruses 
from up to 64 patients and resequenced by the described procedure. On the basis of the 
obtained sequence the optimal therapy may be selected. If there is a mixture of two virus 

10 types of which one has the basic sequence (similar to the case of heterozygotes), the 
mutant may be identified by quantitative comparisons of its hybridization scores with 
scores of other samples, especially control samples containing the basic virus type only. 
Scores twice as small may be obtained for three to four probes that cover the site mutated 
in one of the two virus types present in the sample (see above). 

15 

EXAMPLE 24 

Forensic and Parental Identification 

Sequence polymorphisms make an individual genomic DNA unique. This permits 

analysis of blood or other body fluids or tissues from a crime scene and comparison with 
20 samples from criminal suspects. A sufficient number of polymorphic sites are scored to 

produce a unique signature of a sample. SBH may easily score single nucleotide 

polymorphisms to produce such signatures. 

A set of DNA fragments (10-1000) may be amplified from samples and suspects. 

DNAs from samples and suspects representing one fragment are spotted in one or several 
25 subarrays and each subarray may be replicated 4 times. In three cycles, 12 probes may 

determine the presence of allele A or B in each of the samples, including suspects, for 

each DNA locus. Matching the patterns of samples and suspects may lead to discovery of 

the suspect responsible for the crime. 

The same procedure may be applicable to prove or disprove the identity of parents 
30 of a child. DNA may be prepared and polymorphic loci amplified from the child and 

adults; patterns of A or B alleles may be determined by hybridization for each. 

Comparisons of the obtained patterns, along with positive and negative controls, aide in 
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the determination of familial relationships. In this case, only a significant portion of the 
alleles need match with one parent for identification. Large numbers of scored loci allow 
for the avoidance of statistical errors in the procedure or of masking effects of de novo 
mutations. 

5 

EXAMPLE 25 

Assessing Genetic Diversity of Populations or Species 
and Biological Diversity of Ecological Niches 

Measuring the frequency of allelic variations on a significant number of loci (for 

10 example, several genes or entire mitochondrial DNA) permits development of different 
types of conclusions, such as conclusions regarding the impact of the environment on the 
genotypes, history and evolution of a population or its susceptibility to diseases or 
extinction, and others. These assessments may be performed by testing specific known 
alleles or by full resequencing of some loci to be able to define de novo mutations which 

15 may reveal fine variations or presence of mutagens in the environment. 

Additionally, biodiversity in the microbial world may be surveyed by resequencing 
evolutionary conserved DNA sequences, such as the genes for ribosomal RNAs or genes 
for highly conservative proteins. DNA may be prepared from the environment and 
particular genes amplified using primers corresponding to conservative sequences. DNA 

20 fragments may be cloned preferentially in a plasmid vector (or diluted to the level of one 
molecule per well in multiwell plates and than amplified in vitro). Clones prepared this 
way may be resequenced as described above. Two types of information are obtained. 
First of all, a catalogue of different species may be defined as well as the density of the 
individuals for each species. Another segment of information may be used to measure the 

25 influence of ecological factors or pollution on the ecosystem. It may reveal whether some 
species are eradicated or whether the abundance ratios among species is altered due to the 
pollution. The method also is applicable for sequencing DNAs from fossils. 

EXAMPLE 26 
30 Detection or Quantification of Nucleic Acid Species 

DNA or RNA species may be detected and quantified by employing a probe pair 
including an unlabeled probe fixed to a substrate and a labeled probe in a solution. The 
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species may be detected and quantified by exposure to the unlabeled probe in the presence 
of the labeled probe and ligase. Specifically, the formation of an extended probe by 
ligation of the labeled and unlabeled probe on the sample nucleic acid backbone is 
indicative of the presence of the species to be detected. Thus, the presence of label at a 
5 specific point in the array on the substrate after removing unligated labeled probe 
indicates the presence of a sample species while the quantity of label indicates the 

expression level of the species. 

Alternatively, one or more unlabeled probes may be arrayed on a substrate as first 
members of pairs with one or more labeled probes to be introduced in solution. 

10 According to one method, multiplexing of the label on the array may be carried out by 
using dyes which fluoresce at distinguishable wavelengths. In this manner, a mixture of 
cDNAs applied to an array with pairs of labeled and unlabeled probes specific for species 
to be identified may be examined for the presence of and expression level of cDNA 
species. According to a preferred embodiment this approach may be carried out to 

15 sequence portions of cDNAs by selecting pairs of unlabeled and labeled probes pairs 
comprising sequences which overlap along the sequence of a cDNA to be detected. 

Probes may be selected to detect the presence and quantity of particular pathogenic 
organisms genome by including in the composition selected probe pairs which appear in 
combination only in target pathogenic genome organisms. Thus, while no single probe 

20 pair may necessarily be specific for the pathogenic organism genome, the combination of 
pairs is. Similarly, in detecting or sequencing cDNAs, it might occur that a particular 
probe is not be specific for a cDNA or other type of species. Nevertheless, the presence 
and quantity of a particular species may be determined by a result wherein a combination 
of selected probes situated at distinct array locations is indicative of the presence of a 

25 particular species. 

An infectious agent with about lOkb or more of DNA may be detected using a 
support-bound detection chip without the use of polymerase chain reaction (PCR) or other 
target amplification procedures. According to other methods, the genomes of infectious 
agents including bacteria and viruses are assayed by amplification of a single target 

30 nucleotide sequence through PCR and detection of the presence of target by hybridization 
of a labelled probe specific for the target sequence. Because such an assay is specific for 
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only a single target sequence it therefore is necessary to amplify the gene by methods 
such as PCR to provide sufficient target to provide a detectable signal. 

According to this example, an improved method of detecting nucleotide sequences 
characteristic of infectious agents through a Format 3-type reaction is provided wherein a 
5 solid phase detection chip is prepared which comprises an array of multiple different 
immobilized oligonucleotide probes specific for the infectious agent of interest. A single 
dot comprising a mixture of many unlabeled probes complementary to the target nucleic 
acid concentrates the label specific to a species at one location thereby improving 
sensitivity over diffuse or single probe labeling. Such multiple probes may be of 
10 overlapping sequences of the target nucleotide sequence but may also be non-overlapping 
sequences as well as non-adjacent. Such probes preferably have a length of about 5 to 12 
nucleotides. 

A nucleic acid sample exposed to the probe array and target sequences present in 
the sample will hybridize with the multiple immobilized probes. A pool of multiple 

15 labeled probes selected to specifically bind to the target sequences adjacent to the 
immobilized probes is then applied with the sample to an array of unlabeled 
oligonucleotide probe mixtures. Ligase enzyme is then applied to the chip to ligate the 
adjacent probes on the sample. The detection chip is then washed to remove 
unhybridized and unligated probe and sample nucleic acids and the presence of sample 

20 nucleic acid may be determined by the presence or absence of label. This method 
provides reliable sample detection with about a 1000-fold reduction of molarity of the 
sample agent. 

As a further aspect of the invention, the signal of the labelled probes may be 
amplified by means such as providing a common tail to the free probe which itself 

25 comprises multiple chromogenic, enzymatic or radioactive labels or which is itself 

susceptible to specific binding by a further probe agent which is multiply labelled. In this 
way, a second round of signal amplification may be carried out. Labeled or unlabeled 
probes may be used in a second round of amplification. In this second round of 
amplification, a lengthy DNA sample with multiple labels may result in an increased 

30 amplification intensity signal between 10 to 100 fold which may result in a total signal 
amplification of 100,000 fold. Through the use of both aspects of this example, an 
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intensity signal approximately 100,000 fold may give a positive result of probe-DNA 
ligation without having to employ PCR or other amplification procedures. 

* 

According to a further aspect of the invention an array or super array may be 
prepared which consists of a complete set of probes, for example 4096 6-mer probes. 
5 Arrays of this type are universal in a sense that they can be used for detection or partial 
to complete sequencing of any nucleic acid species. Individual spots in an array may 
contain single probe species or mixtures of probes, for example N(l-3) B(4-6) N(l-3) 
type of mixtures that are synthesized in the single reaction (N represents all four 
nucleotides, B one specific nucleotide and where the associated numbers are a range of 
10 numbers of bases i.e., 1-3 means "from one to three bases".) These mixtures provide 

i 

stronger signal for a nucleic acid species present at low concentration by collecting signal 
from different parts of the same long nucleic acid species molecule. The universal set of 
probes may be subdivided in many subsets which are spotted as unit arrays separated by 
barriers that prevent spreading of hybridization buffer with sample and labeled probe(s). 

15 For detection of a nucleic acid species with a known sequence one of more 

oligonucleotide sequences comprising both unlabelled fixed and labeled probes in solution 
may be selected. Labeled probes are synthesized or selected from the presynthesized 
complete sets of, for example, 7-mers. The labeled probes are added to corresponding 
unit arrays of fixed probes such that a pair of fixed and labeled probes will adjacently 

20 hybridize to the target sequence such that upon administration of ligase the probes will be 
covalently bound. 

If a unit array contains more than one fixed probe (as separated spots or within the 
same spot) that are positive in a given nucleic acid species all corresponding labeled 
probes may be mixed and added to the same unit array. The mixtures of labeled probes 

25 are even more important when mixtures of nucleic acid species are tested. One example 
of a complex mixture of nucleic acid species are mRNAs in one cell or tissue. 

According to one embodiment of the invention unit arrays of fixed probes allow 
use of every possible immobilized probe with cocktails of a relatively small number of 
labeled probes. More complex cocktails of labeled probes may be used if a multiplex 

30 labeling scheme is implemented. Preferred multiplexing methods may use different 
fluorescent dyes or molecular tags that may be separated by mass spectroscopy. 
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Alternatively, according to a preferred embodiment of the invention, relatively 
short fixed probes may be selected which frequently hybridize to many nucleic acid 
sequences. Such short probes are used in combination with a cocktail of labeled probes 
which may be prepared such that at least one labeled probe corresponds to each of the 
5 fixed proves. Preferred cocktails are those in which none of the labeled probes 
corresponds to more than one fixed probe. 

EXAMPLE 27 

Interrogation of Segments of the HIV Virus with All Possible 10-mers 
10 In this example of Format III SBH, an array was generated on nylon membranes 

(e.g., Gene Screen) of all possible bound 5-mers (1024 possible pentamers). The bound 
5-mer oligonucleotides were synthesized with 5' tails of 5 ' -TTTTTT-NNN-3 * (N = all 
four bases A, C, G, T, at this step in the synthesis equal molar amounts of all four bases 
are added). These oligonucleotides were precisely spotted onto the nylon membrane, the 
15 spots were allowed to dry, and the oligonucleotides were immobilized by treating the 
dried spots with UV light. Oligonucleotide densities of up to 18 oligonucleotides per 
square nanometer were obtained using this method. After the UV treatment, the nylon 
membranes were treated with a detergent containing buffer at 60-80°C. The spots of 
oligonucleotides were gridded in subarrays of 10 by 10 spots, and each subarray has 64 5- 
20 mer spots and 36 control spots. 16 subarrays give 1024 5-mers which encompasses all 
possible 5-mers. 

The subarrays in the array were partitioned from each other by physical barriers, 
e.g., a hydrophobic strip, that allowed each subarray to be hybridized to a sample without 
cross-contamination from adjacent subarrays. In a preferred embodiment, the 

25 hydrophobic strip is made from a solution of silicone (e.g., household silicone glue and 
seal paste) in an appropriate solvent (such solvents are well known in the art). This 
solution of silicone grease is applied between the subarrays to form lines which after the 
solvent evaporates act as hydrophobic strips separating the cells. 

In this Format III example, the free or solution (nonbound) 5-mers were 

30 synthesized with 3' tails of 5'-NN-3' (N = all four bases A, C, G, T). In this 

embodiment, the free 5-mers and the bound 5-mers are combined to produce all possible 
10-mers for sequencing a known DNA sequence of less than 20 kb. 20 kb of double 
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stranded DNA is denatured into 40 kb of single-stranded DNA. This 40 kb of ss DNA 
hybridizes to about 4% of all possible 10-mers. This low frequency of 10-mer binding 
and the known target sequence allow the pooling of free or solution (nonbound) 5-mers 
for treatment of each subarray, without a loss of sequence information. In a preferred 
5 embodiment, 16 probes are pooled for each subarray, and all possible 5-mers are 

represented in 64 total pools of free 5-mers. Thus, all possible 10-mers may be probed 
against a DNA sample using 1024 subarray s (16 subarray s for each pool of free 5-mers). 

The target DNA in this embodiment represents two-600 bp segments of the HIV 
virus. These 600 bp segments are represented by pools of 60 overlapping 30-mers (the 

10 30-mers overlap each adjacent 30 mer by 20 nucleotides). The pools of 30-mers mimic a 
target DNA that has been treated using techniques well known in the art to shear, digest, 
and/or random PCR the target DNA to produce a random pool of very small fragments. 

As described above in the previous Format III examples, the free 5-mers are 
labeled with radioactive isotopes, biotin, fluorescent dyes, etc. The labeled free 5-mers 

15 are then hybridized along with the bound 5-mers to the target DNA, and ligated. In a 
preferred embodiment, 300-1000 units of ligase are added to the reaction. The 
hybridization conditions were worked out following the teachings of the previous 
examples. Following ligation and removal of the target DNA and excess free probe, the 
array is assayed to determine the location of labeled probes (using the techniques 

20 described in the examples above). 

The known DNA sequence of the target, and the known free and bound 5-mers in 
each subarray, predict which bound 5-mers will be ligated to a labeled free 5-mer in each 
subarray. The signal from 20 of these predicted dots were lost and 20 new signals were 
gained for each change in the target DNA from the predicted sequence. The overlapping 

25 sequence of the bound 5-mers in these ten new dots identifies which free, labeled 5-mer is 
bound in each new dot. 

Using the described methods, arrays and pools of free, labeled 5-mers, the test 
HIV DNA sequence was probed with all possible 10-mers. Using this Format III 
approach, we properly identified the "wild-type" sequence of the segments tested, as well 
30 as several sequence "mutants" that were introduced into these segments. 
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EXAMPLE 28 

Sequencing of Repetitive DNA Sequences 

In one embodiment, repetitive DNA sequences in the target DNA are sequenced 
with "spacer oligonucleotides" in a modified Format III approach. Spacer 
5 oligonucleotides of varying lengths of the repetitive DNA sequence (the repeating 
jequence is identified on a first SBH run) are hybridized to the target DNA along with a 
first known adjoining oligonucleotide and a second known, or group of possible 
oligonucleotides adjoining the other side of the spacer (known from the first SBH run). 
When a spacer matching the length of the repetitive DNA segment is hybridized to the 
10 target, the two adjacent oligonucleotides can be ligated to the spacer. If the first known 
oligonucleotide is fixed to a substrate, and the second known or possible 
oligonucleotide(s) is labeled, a bound ligation product including the labeled second known 
or possible oligonucleotide(s) is formed when a spacer of the proper length is hybridized 
to the target DNA. 

is 

EXAMPLE 29 

Sequencing Through Branch Points with Format III SBH 

In one embodiment, branch points in the target DNA are sequenced using a third 
set of oligonucleotides and a modified Format III approach. After a first SBH run, 

20 several branch points may be identified when the sequence is compiled. These can be 
solved by hybridizing oligonucleotide(s) that overlap partially with one of the known 
sequences leading into the branch point and then hybridizing to the target an additional 
oligonucleotide that is labeled and corresponds to one of the sequences that comes out of 
the branch point. When the proper oligonucleotides are hybridized to the target DNA, 

25 the labeled oligonucleotide can be ligated to the other(s). In a preferred embodiment, a 
first oligonucleotide that is offset by one to several nucleotides from the branch point is 
selected (so that it reads into one of the branch sequences), a second oligonucleotide 
reading from the first and into the branch point sequence is also selected, and a set of 
third oligonucleotides that correspond to all the possible branch sequences with an overlap 

30 of the branch point sequence by one or a few nucleotides (corresponding to the first 
oligonucleotide) is selected. These oligonucleotides are hybridized to the target DNA, 
and only the third oligonucleotide with the proper branch sequence (that matches the 
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branch sequence of the first oligonucleotide) will produce a ligation product with the first 
and second oligonucleotides. 

EXAMPLE 30 

5 Multiplexing Probes for Analyzing a Target Nucleic Acid 

In this Example, sets of probes are labeled with different labels so that each probe 
of a set can be differentiated from the other probes in the set. Thus, the set of probes 
may be contacted with target nucleic acid in a single hybridization reaction without the 
loss of any probe information. In preferred embodiments, the different labels are 
10 different radioisotopes, or different flourescent labels, or different EMLs. These sets of 
probes may be used in either Format I, Format II or Format III SBH. 

In Format I SBH, the set of differently labeled probes are hybridized to target 
nucleic acid which is fixed to a substrate under conditions that allow differentiation 
between perfect matches one base-pair mismatches. Specific probes which bind to the 

i 

15 target nucleic acid are identified by their different labels and perfect matches are 
determined, at least in part, from this binding information. 

In Format II SBH, the target nucleic acids are labeled with different probes and 
hybridized to arrays of probes. Specific target nucleic acids which bind to the probes are 
identified by their different labels and perfect matches are determined, at least in part, 

20 form this binding information. 

In Format in SBH, the set of differently labeled probes and fixed probes are 
hybridized to a target nucleic acid under conditions that allow perfect matches to be 
differentiated from one base-pair mismatches. Labeled probes that are adjacent, on the 
target, to a fixed probe are bound to the fixed probe, and these products are detected and 

25 differentiated by their different labels. 

In a preferred embodiment, the different labels are EMLs, which can be detected 
by electron capture mass spectrometry (EC-MS) . EMLs may be prepared from a variety 
of backbone molecules, with certain aromatic backbones being particularly preferred, 
e.g., see Xu et al y J. Chromatog. 764:95-102 (1997). The EML is attached to a probe 

30 in a reversible and stable manner, and after the probe is hybridized to target nucleic acid, 
the EML is removed from the probe and identified by standard EC-MS (e.g., the EC-MS 
may be done by a gas chromatograph-mass spectrometer). 
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EXAMPLE 31 

Detection of Low Frequency Target Nucleic Acids 

Format III SBH has sufficient discrimination power to identify a sequence that is 
present in a sample at 1 part to 99 parts of a similar sequence that differs by a single 
5 nucleotide. Thus, Format III can be used to identify a nucleic acid present at a very low 
concentration in a sample of nucleic acids, e.g., a sample derived from blood. 

In one embodiment, the two sequences are for cystic fibrosis and the sequences 
differ from each other by a deletion of three nucleotides. Probes for the two sequences 
were as follows, probes distinguishing the deletion from wild type were fixed to a 
10 substrate, and a labeled contiguous probe was common to both. Using these targets, and 
probes, the deletion mutant could be detected with Format in SBH when it was present at 
one part to ninety nine parts of the wild-type. 

EXAMPLE 32 

15 Polaroid Apparatus and Method for Analyzing a Target Nucleic Acid 

An apparatus for analyzing a nucleic acid can be constructed with two arrays of . 
nucleic acids, and an optional material that prevents the nucleic acids of the two arrays 
from mixing until such mixing is desired. The arrays of the apparatus may be supported 
by a variety of substrates, including but not limited to, nylon membranes, nitrocellulose 

20 membranes, or other materials disclosed above. In preferred embodiments, one of the 
substrate is a membrane separated into sectors by hydrophobic strips, or a suitable 
support material with wells which may contain a gel or sponge. In this embodiment, 
probes are placed on a sector of the membrane, or in the well, the gel, or sponge, and a 
solution (with or without target nucleic acids) is added to the membrane or well so that 

25 the probes are solubilized. The solution with the solubilized probes is then allowed to 
contact the second array of nucleic acids. The nucleic acids may be, but are not limited 
to, oligonucleotide probes, or target nucleic acids, and the probes or target nucleic acids 
may be labeled. The nucleic acids may be labeled with any labels conventionally used in 
the art, including but not limited to radioisotopes, fluorescent labels or electrophore mass 

30 labels. 

The material which prevents mixing of the nucleic acids may be disposed between 
the two arrays in such a way that when the material is removed the nucleic acids of the 
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two arrays mix together. This material may be in the form of a sheet, membrane, or 
other barrier, and this material may be comprised of any material that prevents the mixing 
of the nucleic acids. 

This apparatus may be used in Format I SBH as follows: a first array of the 
5 apparatus has target nucleic acids that are fixed to the substrate, and a second array of the 
apparatus has nucleic acid probes that are labeled and can be removed to interrogate the 
target nucleic acid of the first array. The two arrays are optionally separated by a sheet 
of material that prevents the probes from contacting the target nucleic acid, and when this 
sheet is removed the probes can interrogate the target. After appropriate incubation and 
10 (optionally) washing steps the array of targets may be "read" to determine which probes 
formed perfect matches with the target. This reading may be automated or can be done 
manually (e.g., by eye with an autoradiogram). In Format II SBH, the procedure 
followed would be similar to that described above except that the target is labeled and the 
probes are fixed. 

i 

15 Alternatively, the apparatus may be used in Format III SBH as follows: two arrays 

of nucleic acid probes are formed, the nucleic acid probes of either or both arrays may be 
labeled, and one of the arrays may be fixed to its substrate. The two arrays are separated 
by a sheet of material that prevents the probes from mixing. A Format II reaction is 
initiated by adding target nucleic acid and removing the sheet allowing the probes to mix 

20 with each other and the target. Probes which bind to adjacent sites on the target are 
bound together (e.g., by base-stacking interactions or by covalently joining the 
backbones), and the results are read to determine which probes bound to the target at 
adjacent sites. When one set of probes is fixed to the substrate, the fixed array can be 
read to determine which probes from the other array are bound together with the fixed 

25 probes. As with the above method, this reading may be automated (e.g., with an ELJSA 
reader) or can be done manually (e.g., by eye with an autoradiogram). 

EXAMPLE 33 

Three Dimensional Arrays Of Probes 
30 In a preferred embodiment, the oligonucleotide probes are fixed in a three- 

dimensional array. The three-dimensional array is comprised of multiple layers, such that 
each layer may be analyzed separate and apart from the other layers, or all the layers of 
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the three-dimensional array may be simultaneously analyzed. Three dimensional arrays 
include, for example, an array disposed on a substrate having multiple depressions with 
probes located at different depths within the depressions (each level is made up of probes 
at similar depths within the depression); or an array disposed on a substrate having 
5 depressions of different depths with the probes located at the bottom of the depression, at 
the peaks separating the depressions or some combination of peaks and depressions (each 
level is made up of all probes at a certain depth); or an array disposed on a substrate 
comprised of multiple sheets that are layered to form a three-dimensional array. 

Materials for synthesizing these three-dimensional arrays are well known in the 
10 art, and include the materials previously recited in this specification as suitable as 
supports for probe arrays. In addition, other suitable materials which can support 
oligonucleotide probes, and which preferably, are flexible may be used as substrates. 



EXAMPLE 34 

15 Signature Processing For Clustering cDNA Clones 

A plurality of distinct nucleic acid sequences were obtained from cDNA library, , 
using standard per, SBH sequence signature analysis and Sanger sequencing techniques. 
The inserts of the library were amplified with per using primers specific for vector 
sequences which flank the inserts. These samples were spotted onto nylon membranes 

20 and interrogated with suitable number of oligonucleotide probes and the intensity of 
positive binding probes was measured giving sequence signatures. The clones were 
clustered into groups of similar or identical sequence signatures, and single representative 
clones were selected from each group for gel sequencing. The 5' sequence of the 
amplified inserts was then deduced using the reverse Ml 3 sequencing primer in a typical 

25 Sanger sequencing protocol. PCR products were purified and subjected to flourescent dye 
terminator cycle sequencing. Single pass gel sequencing was done using a 377 Applied 
Biosystems (ABI) sequencer. The majority of clones which were selected and sequenced 
by this method had sequences which differed from each other, and a very small number 
had the same sequence. 

30 
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EXAMPLE 35 

High-Throughput Production Of Chips 

In a preferred embodiment, an apparatus for mass producing arrays of probes may 
comprise a rotating drum or plate coupled with an ink-jet deposition apparatus, for 
5 example, a microdrop dosing head; and a suitable robotics systems, for example, an 
anorad gantry. A particularly preferred embodiment of the apparatus will be described 
referring to Figs. 1-3. 

The apparatus comprises a cylinder (1) to which a suitable substrate is fixed. The 
substrate may be any of the materials previously described as suitable for an array of 

10 probes. In a preferred embodiment, the substrate is a flexible material, and the arrays are 
made directly on the substrate. In alternative embodiments, a flexible substrate is fixed to 
the cylinder and individual chips are fixed on the substrate. The arrays are then made on 
each individual chip. 

In a preferred embodiment, physical barriers are applied to the substrate or chip 

15 and define an array of wells. The physical barriers may be applied to the substrate or chip 
by the apparatus, or alternatively, the physical barriers are applied to the chips or 
substrate before they are fixed to the cylinder (1). A single spot of oligonucleotide 
probes is then placed into each well, wherein the probes placed into an individual well 
may all have the same sequence, or the probes spotted into an individual well may have 

20 different sequences. In a more preferred embodiment, the probe or probes spotted into 
each individual well in an array are different from the probe or probes spotted in the 
other wells of the array. Sequencing chips comprising multiple arrays can then be 
assembled from these arrays. 

After the substrate or substrate and chips are fixed to the cylinder (1), a motor 

25 (not shown) rotates the cylinder. The cylinder's rotation speed is precisely determined by 
any of the ways well known in the art, including, for example using a fixed optical sensor 
and light source that rotates with the cylinder. A dispensing apparatus (3) moves along 
an arm (2) and can deliver probes or other reagents through a dispensing tip (8) to precise 
locations on the substrate or chips using the precise rotation speed calculated above, by 

30 methods well known in the art. The dispensing apparatus receives probes or reagents 
from the reservoir (6) through the feeding line (7). The reservoir (6) holds all the 
necessary probes and other reagents for making the arrays. 
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The dispensing apparatus is depicted in Figure 3. The dispensing apparatus may 
have one or multiple dispensing tips (14 & 8). Each dispensing tip has a sample well 
(13) in a body (12) that receives probes or other reagents through a sample line (10). 
The pressure line (11) pressurizes the chamber (9) to a psi sufficient to force probes or 
5 reagents through the dispensing tip (14 & 8). The sample line (10), well (13) and 
dispensing tip (14 & 8) must be flushed between each change in probe or reagent. An 
appropriate washing buffer is supplied through sample line (10) or through an optional 
dedicated washing line (not shown) to the sample well (13) or optionally a portion or all 
of the chamber (9) may be filled with washing buffer. The washing buffer is then 
10 removed from the sample well (13) and chamber (9) if necessary by an evacuation line 
(not shown) or through the sample line (10) and dispensing tip (14 & 8). 

When the dispensing means has applied probes to all the appropriate sites in each 
array or chip, the substrate (with or without chips) is removed from the cylinder and a 
new substrate is fixed to the cylinder. 

15 

EXAMPLE 36 

Analysis Of A Target Nucleic Acid With Probes Complexed To Discrete Particles 

In this embodiment, a target nucleic acid is interrogated with probes that are 
complexed (covalent or noncovalent) to a plurality of discrete particles. The discrete 

20 particles can be discriminated from each other based on a physical property (or a 
combination of physical properties), and particles with differentiated by the physical 
property are complexed with different probes. In a preferred embodiment, the probe is 
an oligonucleotide of a known sequence and length. Thus, a probe may be identified by 
the physical property of the discrete particle. Suitable probes for this embodiment include 

25 all the probes that are described above in previous sections, including probes which are 
shorter in an informative sense than the probes full length. 

The physical property of the discrete particle may be any property, well known in 
the art, which allows particles to be differentiated into sets. For example, the particles 
could be differentiated into sets based on their size, flourescence, absorbance, 

30 electromagnetic charge, or weight, or the particles could be labeled with dyes, 

radionuclides, or EMLs. Other suitable labels include ligands which can serve as specific 
binding members to a labeled antibody, chemiluminescers, enzymes, antibodies which can 
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serve as a specific binding pair member for a labeled ligand, and the like. A wide variety 
of labels have been employed in immunoassays which can readily be employed. Still 
other labels include antigens, groups with specific reactivity, and electrochemically 
detectable moieties. Still further labels, include any of the labels recited above in 
5 previous sections. These labels and properties may be measured quantitatively by 

methods well known in the art, including for example, those methods described above in 
previous sections, and the particles may be differentiated on the basis of signal intensity 
or signal type (for one of the labels, e.g., different dye densities may be applied to a 
particle, or different types of dyes). In a preferred embodiment, several physical 
10 properties are combined and the different combinations of properties allow discrimination 
of the particles (e.g., ten sizes and ten colors could be combined to differentiate 100 
particle groups). 

The particle-probes allow the exploitation of standard combinatorial approaches so 
that, for example, all possible 10-mers can be synthesized using about 2000 reaction 

15 containers. A first set of 1024 reactions are done to synthesize all possible 5-mers on 
1024 differentially labeled particles. The resulting probe-particles are mixed together, 
and split into another set of 1024 reaction containers. A second set of reactions are done 
with these samples to synthesize all possible 5-mer extensions on the probes in the pools 
of particles. The physical property identifies the first five nucleotides of each probe and 

20 the reaction container will identify the identity of the second five nucleotides of every 
probe. Thus, all possible 10-mer probes are synthesized using 2048 reaction containers. 
This approach is easily modified to make all possible n-mers for a large range of probe 
lengths. 

In a preferred embodiment, the particles are separated into sets by the intensity of 
25 flourescence of the particles. The particles in each set are prepared with varying densities 
of flourescent label, and thus, the particles have different flourescence intensities. The 
flourescent intensity of flourescein is related to concentration over a range of 1:300 to 
1:300,000 (Lockhart et al., 1986), and between 1:3000 to 1:300,000 there is a linear 
relationship (so the flourescein intensity is linear over a range of about 1-300). In the 
30 linear range of detection, 256 sets of particles are labeled with flourescein (e.g., 3-259). 
256 sets of particles allows all possible 4-mers to be attached to different sets of particles. 
By pooling the particles, all possible 5-mers can be made by having four pools of all 
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possible 4-mers and then extending the probes in each pool by A, G, C, or T. Similarly, 
all possible 6-mers can be made by having 16 pools of all possible 4-mers in which each 
pool of 4-mers is extended by one of the 16 possible two base permutations of A, G, C, 
and T (etc. for 7-mers there are 64 pools, 8-mers there are 256 pools, and so forth). 
5 The 5-mer probes (in four pools) are used to interrogate a target nucleic acid. The 

target nucleic acid is labeled with another flourescent dye, or other different label (as 
described above). Labeled target is mixed with the four pools, and complementary 
probes in each pool hybridize to the target nucleic acid. These hybridization complexes 
are detected by methods well-known in the art, and the positively hybridizing probes are 
10 then identified by detecting the flourescence intensity of the particle. In a most preferred 
embodiment, the mixture of probe-particles and target nucleic acid are fed through a 
flow-cytometer or other separating instrument one particle at a time, and the particle label 
and the target are measured to determine which probes are complementary to the target 
nucleic acid. 

15 In an alternative embodiment, a set of free probes is labeled with another 

flourescent dye, or other label (as described above), and individual free probes are mixed 
with each pool of 5-mer probes (four pools) and then the mixtures are hybridized with the 
target nucleic acid. An agent is added to covalently attach free probe to 5-mer probe (see 
previous sections for a description of suitable agents), when the free probe is bound to a 

20 site on the target nucleic acid that is adjacent to the site on which the 5-mer probe is 
bound (the free probe site must be adjacent the end of the 5-mer probe which can be 
ligated). The particles are then assayed, by methods well known in the art, to determine 
which particles have been covalently coupled to the free probe, i.e., the particles which 
have the free probe label, and the 5-mer probe is identified by the flourescent intensity of 

25 the particle. In a most preferred embodiment, the mixture of probe-particles, free probes, 
and target nucleic acid are fed through a flow-cytometer one particle at a time, and the 
particle label and the free probe label are measured to determine which probes are 
complementary to the target nucleic acid. 

In preferred embodiment, a single apparatus houses all or most of the 

30 manipulations for an analysis of a target nucleic acid with the probe-particle complexes. 
The apparatus has one or more reagent chambers in which buffer and labeled target 
nucleic acid are thoroughly mixed (target nucleic acid may be added manually or 
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automatically). The mixture is aliquoted from the reagent chamber into a plurality of 
reaction chambers, and each reaction chamber has a pool of probe-particle complexes. 
The probe-particles and target nucleic acid react under conditions which allow 
complementary probes to bind with the target nucleic acid. Excess target nucleic acid, 
5 i.e., nonbound, is removed from the reaction chamber (e.g., by washing), and the 
panicles bound to target nucleic acid are identified by the association of target nucleic 
acid label with the particles, and the probe is identified by the physical property of the 
particle. In a preferred embodiment, after removing excess target, the particles move 
single file through a channel from the reaction chamber to the detecting device(s). As 
10 single particles move past the detecting device(s) they measure the target label and the 

i 

physical property of the particle. In an alternative preferred embodiment, before or after 
removing excess target, the particles are fractionated, for example, by size (e.g., 
exclusion chromatography), charge (e.g., ion exchange chromatography), and/or density- 
weight into their sets using one or a combinations of these physical properties. These 

i 

15 fractionated particles are then assayed by the detecting device(s). 

In an alternative of this embodiment, the main reagent chamber is supplied with 
buffer, target nucleic acid, a pool of probe-particle complexes, and a chemical or 
enzymatic ligating reagent. These components are thoroughly mixed and then aliquoted 
from the reagent chamber into a plurality of reaction chambers. Each of the reaction 

20 chambers has a labeled free probe. Alternatively, the pool of probe particle complexes 
may be placed in the reaction chamber with the free probe instead of adding them to the 
reagent chamber. Additionally, the free probes could be added to the reagent chamber, 
and the pool of probe particles could be added to the reaction chamber. The probe- 
particles, target nucleic acid, and free probe react under conditions which allow free- and 

25 particle-probes to bind with adjacent sites on the target nucleic acid so that free probe is 
ligated to the probe particle. Excess free probe (i.e., nonligated), and target nucleic acid 
are removed from the reaction chamber (e.g., by washing), and the ligated probes are 
identified by the association of free probe label with the particles, and the probes 
complexed to the particles are identified by the physical property of the particle. In a 

30 preferred embodiment, after removing excess probe and the target, the particles move 
single file through a channel from the reaction chamber to the detecting device(s). As 
single particles move past the detecting device(s) they measure free probe label covalently 
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attached to the particles, and the physical property of the particle. In an alternative 
preferred embodiment, before or after removing excess probe and the target, the particles 
are fractionated, for example, by size (e.g., exclusion chromatography), charge (e.g., ion 
exchange chromatography), and/or density/weight into their sets using the physical 
5 property. These fractionated particles are assayed by the detecting device(s). 

In a preferred embodiment, there is a sset of econd reaction chambers in the 
apparatus, and the pool of probe-particles are placed in the second reaction chamber. 
Target and buffer are mixed in the reagent chamber and these are fed into the first 
reaction chamber which contains the labeled free probe. The probe and target are mixed, 

10 and optionally the probe may hybridize to the target. This mixture of labeled probe and 
target is then passed to the second reaction chamber which contains the pool of probe- 
particles. The free probe and probe-particles hybridize to target and appropriate probes 
are ligated in the second reaction chamber. The ligating agent may be added at the 
reagent chamber or in either reaction chamber, preferably, the ligating agent is added in 

15 the second reaction chamber. The probe-particle hybridization products in the second 
reaction chamber are analyzed as above. 

In one embodiment, the target nucleic acid is not amplified prior to analysis (either 
by PCR or in a vector, e.g., a lambda library). Preferably longer free and particle 
probes are used in this embodiment because of the increase in sequence complexity of the 

20 sample (i.e., to distinguish positives over background). 

The probe-particle embodiments described in this example are suitable for use in 
any of the applications previously described, including, but not limited to the previously 
described diagnostic and sequencing applications. Additionally, these probe particle 
embodiments may be modified by any the previously described variations or 

25 modifications. 

EXAMPLE 37 

The Interaction of Complementary Polynucleotides in the Presence of 
Agents Which Modify the Binding Between the Polynucleotides. 

30 In this embodiment, the discrimination of perfect matches from mismatches in the 

binding of complementary polynucleotides is modulated by the addition of an agent or 

agents. In a preferred embodiment, the complementary polynucleotides are a target 
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polynucleotide and a polynucleotide probe. The discrimination of perfect matches from 
mismatches may be modulated by adding an agent wherein the agent is a salt such as 
trialkyl ammonium salt (e.g., TMAC, Ricelli et al., Nucl. Acids Res. 21:3785-3788 
(1993)), sodium chloride, phosphate salts, and borate salts, organic solvents such as 
5 formamide, glycol, dimethylsulfoxide, and dimethylformamide, urea, guanidinium, amino 
acid analogs such as betaine (Henke et al., Nucl. Acids Res. 19:3957-3958 (1997); Rees 
et al.* Biochemistry 32:137-144 (1993)), polyamines such as spermidine and spermine 
(Thomas et al., Nucl. Acids Res. 25:2396-2402 (1997)), or other positively charged 
molecules which neutralize the negative charge of the phosphate backbone, detergents 
10 such as sodium dodecyl sulfate, and sodium lauryl sarcosinate, minor/major groove 

* 

binding agents, positively charged polypeptides, and intercalating agents such as acridine, 
ethidium bromide, and anthracine. In a preferred embodiment, a mixture of agents is 
added to the hybridization reaction to modulate the discrimination of perfect matches from 
mismatches. Some of these agents effect discrimination by reducing the entropy of 

15 melting between two complementary strands. 

In a preferred embodiment, the discrimination of perfect matches from mismatches 
is improved by the agent or agents. For example, formamide, a commonly used 
denaturing agent, has been shown to preferentially destabilize mismatches versus perfect 
matches in a format III reaction. As described above, a format III reaction was set up, 

20 and then varying amounts of formamide were added (0%, 10%, 20%, 30%, 40%, and 
50%). At 0% a perfect match signal was detected and the background (mismatches) was 
high. At 10% formamide, there was a good perfect match signal and the 
background/mismatch signal was reduced. At 20% formamide, the perfect match signal 
was reduced (but detectable) and the background/mismatch signal was eliminated. At 

25 30% -50% formamide there was no perfect match or background/mismatch signal. 

In an alternative embodiment, an agent is used to reduce or increase the T m of a 
pair of complementary polynucleotides. In a more preferred embodiment, a mixture of 
the agents is used to reduce or increase the T m of a pair of complementary 
polynucleotides. The agents may alter the T m in a number of ways, two examples, which 

30 are not meant to limit the invention, are (1) agents which disrupt the hydrogen bonding 
between the bases of two complementary polynucleotides (Goodman, Proc. Nat'l Acad. 
Sci. 94:10493-10495 (1997); Moran et al., Proc. Nat'l Acad. Sci. 94:10506-10511 
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(1997); Nguyen et al., Nucl. Acids Res. 25:3059-3065 (1997)), and (2) agents which 
neutralize or shield the negative charges of the phosphates in the sugar phosphate 
backbone of the polynucleotide (Thomas et al., Nucl. Acids Res. 25:2396-2402 (1997)). 
By strengthening or weakening (1) and/or (2) one can modulate the T m of a 
5 complementary pair of polynucleotides. 

In a most preferred embodiment, an agent or agents are added to decrease the 
binding energy of GC base pairs, or increase the binding energy of AT base pairs, or 
both. In a preferred embodiment, the agent or agents are added so that the binding 
, energy from an AT base pair is approximately equivalent to the binding energy of a GC 
10 base pair. Thus, the energy of binding between two complementary polynucleotides is 
solely dependent on length. The energy of binding of these complementary 
polynucleotides may be increased by adding an agent that neutralizes or shields the 
negative charges of the phosphate groups in the polynucleotide backbone. 

15 The present invention is not to be limited in scope by the exemplified 

embodiments which are intended as illustrations of single aspects of the invention, and 
compositions and methods which are functionally equivalent are within the scope of the 
invention. Indeed, numerous modifications and variations in the practice of the invention 
are expected to occur to those skilled in the art upon consideration of the present 

20 preferred embodiments. Consequently, the only limitations which should be placed upon 
the scope of the invention are those which appear in the appended claims. 

All references cited within the body of the instant specification are hereby 
incorporated by reference in their entirety. 
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CLAIMS 



WHAT IS CLAIMED IS: 



1. An array of a plurality of oligonucleotide probes, comprising: 
5 a substrate; 

a material that forms a physical barrier, wherein the material is disposed on the 
substrate to form a grid having a plurality of wells; and 

wherein the plurality of oligonucleotide probes are arranged to form an array in 
the plurality of wells, wherein each well contains one spot of probes fixed on the 
10 substrate. 



15 



20 



2. The array of Claim 1, wherein each individual spot has probes with 
sequences that are different from the other probes at the other spots in the array. 

3. The array of Claim 2, wherein the probes in each individual spot have the 
same sequence. 

4. The array of Claim 1, wherein the distance between a center of one spot 
and a center of an adjacent spot is at least 325 /on. 

5. A sequencing chip comprising a plurality of arrays of Claim 1. 



6. A sequencing chip comprising a plurality of arrays of Claim 2. 



25 7. A sequencing chip comprising a plurality of arrays of Claim 3. 

8. The array of Claim 1, wherein the oligonucleotide probe comprises an 
informational portion and a reactive group wherein the reactive group is used to attach the 
probe to the substrate. 

30 

9. The array of Claim 8, wherein the oligonucleotide probe further comprises 
at least one randomized position. 
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10. The array of Claim 9, wherein the oligonucleotide probe further comprises 
a spacer. 

11. An array of a plurality of oligonucleotide probes, comprising a substrate 
5 having a plurality of levels, wherein the plurality of probes are fixed to the plurality of 

levels in the substrate. 

12. The array of Claim 11, wherein each level may be analyzed individually. 

10 13. The array of Claim 11, wherein the plurality of levels may be analyzed 

simultaneously. 

14. The array of Claim 11, wherein the oligonucleotide probe comprises an 
informational portion and a reactive group wherein the reactive group is used to attach the 

15 probe to the substrate. 

15. The array of Claim 14, wherein the oligonucleotide probe further 
comprises at least one randomized position. 

20 16. The array of Claim 15, wherein the oligonucleotide probe further 

comprises a spacer. 

17. A method for analyzing a target nucleic acid, comprising the steps of: 
contacting the target nucleic acid with a plurality of oligonucleotide probes, 
25 wherein the probes are complexed with a plurality of different discrete particles that can 

be discriminated from one another based on a physical property and wherein a different 

probe is complexed with each type of discrete particle; 

detecting those probes that are complementary to the target nucleic acid; and 
analyzing the target nucleic acid from a set of complementary probes. 

30 
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18. The method of Claim 17, further comprising the step of separating the 
discrete particles into fractions, wherein the discrete particles are separated on the basis of 
the physical property. 

5 19. The method of Claim 18, wherein the set of complementary probes has at 

least two probes which overlap. 

i 

20. The method of Claim 18, wherein a sequence of the target nucleic acid is 
compiled in the analyzing step. 

10 

21. The method of Claim 18, wherein the complementary probes are identified 
by the physical property of the discrete particle. 

22. The method of Claim 18, wherein wherein the physical property is 

15 associated with a molecule selected from the group consisting of a dye, a radionucleotide, 
an EML, and a flourescent molecule. 

23. The method of Claim 18, wherein the physical property is selected from 
the group comprising a size, a charge, an absorbance, and a weight. 

20 

24. The method of Claim 23, wherein the physical property is associated with 
an intensity of the physical property. 

25. The method of Claim 23, wherein the physical property is associated with a 
25 plurality of different molecules. 

26. The method of Claim 18, wherein an informative portion of the probes is 
shorter than the probes full length. 

30 27. The method of Claim 18, wherein the target nucleic acid contains a label, 

and the complementary probes are detected by the label on the target nucleic acid. 
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28. The method of Claim 17, wherein the detecting step is performed on 
individual discrete particles by passing the individual particles past a detector, 

29. The method of Claim 28, wherein the set of complementary probes has at 
5 least two probes which overlap. 



30. The method of Claim 28, wherein a sequence of the target nucleic acid is 
compiled in the analyzing step. 

10 31. The method of Claim 28, wherein the complementary probes are identified 

by the physical property of the discrete particle. 

32. The method of Claim 28, wherein wherein the physical property is 
associated with a molecule selected from the group consisting of a dye, a radionucleotide, 

15 an EML, and a flourescent molecule. 

33. The method of Claim 28, wherein the physical property is selected from 
the group comprising a size, a charge, an absorbance, and a weight. 

20 34. The method of Claim 33, wherein the physical property is associated with 

an intensity of the physical property. 

35. The method of Claim 33, wherein the physical property is associated with a 
plurality of different molecules. 

25 

36. The method of Claim 28, wherein an informative portion of the probes is 
shorter than the probes full length. 

37. The method of Claim 28, wherein the target nucleic acid contains a label, 
30 and the complementary probes are detected by the label on the target nucleic acid. 
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38. The method of Claim 28, further comprising the step of contacting the 
target nucleic acid with a plurality of free oligonucleotides, and 

covalently joining a complementary free probe, bound at a site in the target nucleic 
acid, to a complementary probe complexed to the discrete particle which is bound to a 
5 site on the target nucleic acid that is adjacent to the site on which the free probe is bound, 
and 

wherein the detecting step identifies the free probe which is covalently joined to 
the probe of the discrete particle. 



10 39. The method of Claim 38, wherein the set of complementary, covalently 

joined probes has at least two covalently joined probes which overlap. 

40. The method of Claim 38, wherein a sequence of the target nucleic acid is 
compiled in the analyzing step. 

15 

41. The method of Claim 38, wherein the probes complexed to the discrete 
partcile are identified by the physical property of the discrete particle. 

42. The method of Claim 38, further comprising the step of separating the 

20 discrete particles into fractions, wherein the discrete particles are separated on the basis of 
the physical property. 



43. The method of Claim 42, wherein the discrete particles are seperated into 
fractions using a flow cytometer. 

25 

44. The method of Claim 38, wherein the physical property is associated with a 
molecule selected from the group consisting of a dye, a radionucleotide, an EML, and a 
flourescent molecule. 
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45. The method of Claim 38, wherein the physical property property is selected 
from the group comprising a size, a charge, an absorbance, and a weight. 

46. The method of Claim 45, wherein the physical property is associated with 
5 an intensity of the physical property. 



47. The method of Claim 45, wherein the physical property is associated with a 
plurality of different molecules. 

i 

t 

10 48. The method of Claim 38, wherein an informative portion of the free probes 

is shorter than the probes full length. 

49. The method of Claim 38, wherein an informative portion of the probes 
complexed to the discrete particles is shorter than the probes full length. 

15 

50. The method of Claim 38, wherein an informative portion of the free probes 
and an informative portion of the probes complexed to the discrete particles are shorter 
than the probes full length. 

20 51. A method for analyzing a target nucleic acid, comprising the steps of: 

contacting the target nucleic acid with a probe under conditions which allow a 
perfect match to be discriminated from a mismatch, wherein an agent is added which 
increase the discrimination of the perfect match from the mismatch; and 

detecting whether the probe is complementary to the target nucleic acid. 

25 

52. The method of Claim 51, wherein the agent is selected from the group 
comprising a salt, an organic solvent, an urea, a guanidinium, an amino acid analogs, a 
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polyamine, other positively charged molecules which neutralize the negative charge of the 
phosphate backbone, a detergent, a minor/major groove binding agent, a positively 
charged polypeptide, and an intercalating agent. 



5 53. The method of Claim 52, wherein the salt is selected from the group 

comprising a trialkyl ammonium salt, a sodium chloride, a phosphate salt, and a borate 
salt. 

54. The method of Claim 52, wherein the organic solvent is selected from the 
10 group comprising a formamide, a glycol, a dimethylsulfoxide, and a dimethylformamide. 

55. The method of Claim 52, wherein the amino acid analog is a betaine. 

56. The method of Claim 52, wherein the polyamine is selected from the group 
15 comprising a spermidine and a spermine. 

57. The method of Claim 52, wherein the detergent is selected from the group 
comprising a sodium dodecyl sulfate, and a sodium lauryl sarcosinate. 

20 58, The method of Claim 52, wherein the intercalating agent is selected from 

the group comprising an acridine, an ethidium bromide, and an anthracine. 



59. The method of Claim 51, wherein a plurality of agents is added. 



60. The method of Claim 59, wherein the agents are selected from the group 
comprising a salt, an organic solvent, an urea, a guanidinium, an amino acid analogs, a 
polyamine, other positively charged molecules which neutralize the negative charge of the 
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phosphate backbone, a detergent, a minor/major groove binding agent, a positively 
charged polypeptide, and an intercalating agent. 

61. A method for analyzing a target nucleic acid, comprising the steps of: 
5 providing an array of a plurality of fixed oligonucleotide probes; 

providing a plurality of labeled oligonucleotide probes; 

contacting the target nucleic acid with the fixed probes and the labeled probes 
under conditions that allow the probes which form perfect matches with the target nucleic 
acid to be distinguished from the probes bound with a one base mismatch to the target 
10 nucleic acid, wherein an agent is added which increase the discrimination of the perfect 
match from the one base pair mismatch; 

covalently joining the fixed probe, bound at a site in the target nucleic acid, to the 
labeled probe, which is hybridized to a site on the target nucleic acid that is adjacent to 
the site on which the fixed probe is bound; and 

15 identifying the fixed probes and the labeled probes that are covalently joined. 

+ 

62. The method of Claim 61, wherein the agent is selected from the group 
comprising a salt, an organic solvent, an urea, a guanidinium, an amino acid analogs, a 
polyamine, other positively charged molecules which neutralize the negative charge of the 

20 phosphate backbone, a detergent, a minor/major groove binding agent, a positively 
charged polypeptide, and an intercalating agent. 

63. The method of Claim 61, wherein the agent is formamide. 

25 64. The method of Claim 61, wherein a plurality of agents is added. 
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